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1  Introduction 

In  this  paper  we  address  the  problem  of  estimating  the  3D  structure  of  a  scene  and  the 
relative  motion  of  a  camera  with  respect  to  that  scene  from  a  sequence  of  images  acquired 
by  the  camera.  This  problem,  also  referred  to  as  the  structure  and  motion  estimation 
problem  has  been  studied  extensively  in  computer  vision  with  rather  limited  success  in  the 
general  case.  We  show  how  the  use  of  multiple  frames  allows  us  not  only  to  recover  both 
structure  and  motion  but  also  to  do  so  in  a  way  that  is  robust  and  practical  for  real  images. 

Knowledge  of  scene  structure  and  camera  motion  is  useful  for  the  navigation  of  au¬ 
tonomous  vehicles  on  land,  undersea  and  in  space.  Information  about  the  3D  structure  of 
the  scene  is  also  of  interest  for  the  localization  and  recognition  of  objects. 

Almost  all  exisiting  approaches  to  solving  the  structure  and  motion  estimation  problem 
have  at  least  one  of  the  following  two  properties: 

1.  They  use  the  optical  flow  field  or  feature  matches  as  input. 

2.  They  use  only  two  sequential  frames  out  of  an  image  sequence. 

Both  of  these  properties  represent  severe  limitations,  as  we  discuss  below.  The  purpose 
of  this  paper  is  to  show  how  both  limitations  can  be  overcome  to  produce  practical  and 
accurate  estimates  of  structure  and  motion. 

1.1  Feature  and  optical  flow  based  structure  and  motion  estimation 

Optical  flow  is  a  term  that  describes  computational  approximations  to  the  projection  of 
a  3D  velocity  field  into  the  image  plane  (see  section  1.4  for  details).  A  number  of  methods 
are  available  for  computing  this  vector  field  of  image  velocities:  Anandan  [3],  Horn  and 
Schunck  [26],  Hildreth  [24],  Nagel  and  Enkelmann  [37],  Heeger  [20]  and  many  others.  The 
computation  of  optical  flow  is  computationally  very  expensive  and  the  optical  flow  usually 
exhibits  significant  systematic  and  random  error. 

Features  are  locations  in  the  image  that  are  easily  identified  such  as  edges,  comers  or 
other  distinguished  points  in  the  brightness  array.  Feature-based  methods  first  determine 
such  locations  in  two  frames  and  then  match  features  between  frames.  Conceptually,  the 
optical  flow  field  is  a  set  of  dense  feature  matches,  one  match  for  each  pixel  in  the  image. 
Conversely,  one  can  view  feature  matching  methods  as  optical  flow  computation  at  a  few; 
selected  locations  where  the  probablity  of  obtaining  a  good  optical  flow  estimate  is  large. 
Therefore,  the  problems  we  encounter  in  feature  matching  are  essentially  the  same  as  in  op¬ 
tical  flow  except  that  we  focus  on  those  locations  where  these  problems  are  least  noticeable 
by  sacrificing  the  availability  of  dense  information. 

Despite  the  difficulties  in  computing  optical  flow  or  feature  matches  both  are  very 
intuitive  intermediate  representations  with  an  exactly  quantifiable  relationship  to  the  pa¬ 
rameters  of  rigid  body  motion  and  scene  structure  (assuming  that  the  optical  flow  prefectly 
matches  the  motion  field).  Unfortunately,  however,  this  relationship  is  nonlinear  and  no 
closed-form  solution  for  the  general  case  exists.  Among  the  methods  which  use  optical  flow 
or  feature  matches  are  Bruss  and  Horn  [10],  Adiv  [1],  Tsai  and  Huang  [49],  Longuet- Higgins 
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and  Prazdny  [31],  Ullman  [50],  Heel  [22],  Mathies,  Szelsiki  and  Kanade  [34],  Subbarao  [47], 
Spetsakis  and  Aloimonos  [45],  Broida  and  Chellappa  [8]. 

1.2  Two-frame  structure  and  motion  estimation 

Features  matches  and  optical  flow  are  generally  computed  from  two  sequential  frames  in 
an  image  sequence.  As  a  consequence,  the  information  that  can  be  extracted  from  a  single 
optical  flow  field  is  limited  to  a  snapshot  of  short  duration.  Since  both  image  brightness 
measurement  and  optical  flow  computation  introduce  error,  such  a  single  snapshot  cannot 
provide  very  accurate  information.  In  addition,  entire  sequences  of  images  are  usually 
available,  possibly  in  a  continuous  fashion,  in  the  motion  imaging  situation.  Repeatedly 
applying  a  two-frame  algorithm  to  sequential  pairs  of  images  does  not  exploit  the  noise- 
reduction  potential  available  in  the  form  of  redundant  information. 

Examples  of  two-frame  motion  estimation  are  Bruss  and  Horn  [10],  Adiv  [1],  Tsai  and 
Huang  [49],  Longuet- Higgins  and  Prazdny  [31],  Weng,  Huang  and  Ahuja  [54],  Fennema  and 
Thompson  [13],  Waxman  and  Wohn  [53],  Roach  and  Aggarwal  [43],  Mitiche  [36],  Negah- 
daripour,  Weldon  and  Horn  [40],  [28]. 

1.3  Direct  and  multiframe  methods 

Recently  a  number  of  approaches  have  been  proposed  to  overcome  the  aforementioned 
difficulties.  Some  researchers  have  developed  techniques  that  extract  motion  and  structure 
information  without  computation  of  the  optical  flow.  Best  known  are  the  “direct”  methods 
of  Horn,  Weldon  and  Negahdaripour  [40],  [41],  [28].  The  basis  of  this  work  is  the  “brightness 
change  constraint  equation”  which  links  image  brightness  values  to  motion  and  structure 
parameters.  Despite  this  simplification,  the  resulting  equations  are  still  nonlinear  and  no 
closed-form  solution  exists  in  the  general  cases.  The  authors  show  how  motion  or  structure 
can  be  recovered  in  various  special  cases  such  as  purely  translational  motion  or  planar 
surfaces. 

Aloimonos,  Herve  and  Ito  [2],  [29]  suggested  a  method  for  planar  surfaces  and  developed 
the  concept  of  a  “linear  feature”  for  direct  motion  estimation.  This  work  also  makes  use 
of  the  brightness  change  constraint  equation  mentioned  above.  Kanatani  [30]  uses  “feature 
functionals”  to  determine  8  flow  parameters  from  which  the  motion  and  normal  of  a  planar 
surface  may  be  obtained. 

The  idea  of  using  multiple  frames  in  order  to  reduce  the  influence  of  noise  has  become 
increasingly  popular  recently.  Ullman  [50]  presented  a  method  termed  the  “incremental 
rigidity  scheme”  which  operated  iteratively  on  a  sequence  of  images  to  recover  structure. 
It  required  feature  extraction  and  matching. 

Broida  and  Chellappa  [9]  suggested  the  use  of  a  Kalman  filter  for  iterative  improvement 
of  estimates  for  structure  and  motion  parameters  over  a  sequence  of  images.  They  also  used 
a  sequence  of  matched  features  as  input  for  their  algorithm.  In  addition,  a  model  for  the 
object  in  the  scene  was  necessary  which  contained  information  about  the  relative  location 
of  feature  points. 

Another  recursive  depth  estimation  procedure  for  more  than  two  frames  was  suggested 
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by  Bharwani,  Riseman  and  Hanson  [5].  Using  feature  matches  and  known  translational 
motion  depth  is  then  recovered. 

Stuller  and  Krishnamurthy  [46]  also  used  a  Kalman  filtering  approach  in  which  the 
projected  translational  motion  of  an  object  is  formulated  in  terms  of  a  dynamical  systems. 
Then  a  recursive  estimation  scheme  is  used  to  recover  the  2D  translational  motion. 

Dickmans  [11],  [12]  has  investigated  a  Kalman  filtering  based  motion  analysis  algorithm 
which  he  termed  “4D  dynamic  scene  interpretation”.  The  algorithm  and  a  matching  hard¬ 
ware  architecture  have  been  used  successfully  for  the  guidance  of  vehicles  in  real  traffic 
environments. 

Backed  by  encouraging  experimental  results,  Matthies,  Szeliski  and  Kanade  [35],  [34] 
presented  a  practical  and  simple  method  for  estimating  structure  using  a  Kalman  filter 
based  algorithm  over  a  sequence  of  images.  They  succeeded  in  recovering  dense  depth 
maps  of  unprecedented  quality  from  real  images  under  the  assumption  of  known  purely 
translational  motion  parallel  to  the  image  plane. 

Heel  [21],  [22],  [23]  independently  developed  a  Kalman  filter  based  method  for  the 
dense  estimation  of  structure  similar  to  the  one  by  Matthies,  Szeliski  and  Kanade.  This 
“dynamic  motion  vision”  had  the  additional  capability  of  estimating  the  parameters  of  rigid 
body  motion  rather  than  requiring  them  to  be  known.  It  also  placed  no  restriction  on  the 
nature  of  the  motion. 

A  prominently  different  approach  was  the  “epipolar  image”  method  by  Bolles  and  Baker 
[7]  (also  Yamamoto  [56]).  This  method  is  particularly  interesting  because  it  does  not 
require  optical  flow  computation  or  conventional  feature  matching  and  makes  use  of  multiple 
frames.  For  translational  motion  parallel  to  the  image  plane,  a  horizontal  slice  through  a 
sequence  of  images  provides  important  information  not  only  about  rigid  body  motion  and 
structure  but  also  about  occlusion  and  segmentation.  More  recently  [4]  Baker  and  Bolles 
have  generalized  this  method  to  handle  more  complex  motions  and  to  work  incrementally 
as  new  images  become  available.  In  this  case,  the  method  becomes  similar  to  previously 
mentioned  Kalman  filtering  methods. 

1.4  Direct  dynamic  motion  vision 

In  this  paper  we  present  an  algorithm  for  the  dense  estimation  of  structure  and  motion 
that 

•  does  not  require  the  computation  of  optical  flow  or  feature  matches 

•  uses  an  image  sequence  of  arbitrary  length  to  reduce  the  effect  of  noise. 

In  doing  so,  this  method  overcomes  the  obstacles  that  previous  approaches  encountered  for 
estimating  structure  and  motion  efficiently  and  robustly. 

We  build  on  the  “direct”  work  of  Horn,  Weldon  and  Negahdaripour  to  avoid  computa¬ 
tion  of  the  optical  flow.  We  combine  it  with  the  Kalman  filter  recursive  estimation  technique 
to  continuously  update  and  improve  structure  information  over  time,  as  new  measurements 
from  images  become  available.  A  least-squares  motion  estimation  procedure  is  interleaved 
with  the  stages  of  the  filter,  so  that  both  motion  and  structure  are  obtained  simultaneously, 
which  was  not  previously  possible  in  the  case  of  general  surfaces  and  motions. 
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The  paper  is  organized  as  follows:  We  briefly  review  the  direct  approach  to  motion 
vision  and  the  theory  of  recursive  estimation  as  they  apply  to  our  application.  We  then  for¬ 
mulate  the  dynamic  motion  and  structure  estimation  algorithm  by  combining  direct  depth 
measurement  with  Kalman  Altering.  After  the  outline  of  the  appioach,  each  stage  of  the  al¬ 
gorithm,  depth  measurement,  update,  prediction,  motion  estimation  and  smoothing/fllling 
in  are  discussed  in  detail,  analyzed  with  respect  to  their  algorithmic  complexity  and  eval¬ 
uated  experimentally.  Finally,  the  results  which  may  be  obtained  using  this  method  are 
evaluated  in  an  extensive  experimental  section  using  a  variety  of  real  image  sequences. 
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2  The  Direct  Motion  Vision  Approach 

2.1  The  motion  imaging  situation 

The  traditional  approach  to  motion  vision  as  outlined  by  Longuet-Higgins  and  Prazdny 
[31]  was  to  compute  the  optical  flow  as  an  intermediate  representation  of  motion  information 
and  then  compute  rigid  body  motion  and  structure  parameters  using  the  optical  flow. 
Optical  flow  denotes  a  computational  approximation  to  the  projection  of  the  3D  velocity 
field  of  a  scene  into  the  image  plane. 


© 


Figure  1:  Coordinate  system  and  projection  geometry  in  the  motion  imaging  situation. 

In  what  follows  we  assume  that  only  one  independently  moving  rigid  body  is  in  the 
image  area  of  interest.  This  is  the  case  when  a  camera  moves  through  a  static  environment 
for  example.  If  multiple  moving  objects  are  present  a  segmentation  into  regions  of  single 
independently  moving  objects  must  be  performed  prior  to  applying  this  formalism. 

Suppose  we  denote  the  relative  motion  of  the  camera  with  respect  to  the  scene  by  a 
translation  4  =  [t/,  V,  W]r  and  a  rotation  =  [A,2?,C]r.  Choose  a  coordinate  system  as 
shown  in  figure  1  with  the  origin  at  the  camera’s  center  of  projection  (principal  distance 
f)  and  the  optical  axis  aligned  with  the  z  axis  ft  =  [0, 0,  l]r.  A  point  P  at  location  R  = 
[X,Y,Z]T  in  the  scene  is  imaged  at  pixel  location  r  =  [z,y,  f]T.  The  value  Z  is  referred  to 
as  the  depth  of  point  P. 

Given  this  situation,  the  point  P  will  move  according  to 

4  =  -t-wxR  (1) 
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where  the  dot  denotes  temporal  differentiation.  We  can  think  of  the  vector  &  as  a  3D 
motion  vector  that  is  attached  to  a  point  on  the  surface  of  the  scene  in  view.  The  collection 
of  these  vectors  for  all  surface  points  constitutes  a  3D  velocity  vector  field.  Suppose  this 
velocity  field  is  now  projected  into  the  image  plane  via  the  equation 


of  perspective  projection.  We  differentiate  equation  (2)  and  then  substitute  (1)  to  obtain 

*  =  -i  x  (r  x  (yr  x  u>  -  it)  (3) 

The  vector  f  describes  the  motion  of  the  projection  of  P  into  the  image  plane.  The  collection 
of  r  vectors  at  every  point  in  the  image  plane  is  a  vector  field  of  image  velocities  commonly 
referred  to  as  the  motion  field.  This  is  the  motion  information  that  an  observer  could 
perceive  given  a  perfect  motion  “measurement”  device. 

2.2  Motion  and  structure  from  optical  flow 

We  investigate  the  motion  field  equation  (3)  in  more  detail.  This  vector  equation  for 
r  consists  of  three  components,  the  third  of  which  is  identical  to  zero,  since  the  third 
component  of  r  is  the  constant  focal  length  /.  The  remaining  two  components  of  the  image 
velocity  (i,y)  =  (u,c)  can  be  written  as 

«  =  ~fU  +  XW  +  j(Axy  -  B(z*  +  /»)  +  Cyf)  (4) 

*  =  ~/V  +  ^  +  j{A(y3  +  /*)  -  Bxy  -  Cxf).  (5) 

These  motion  field  equations  can  be  established  at  every  pixel  (z,y)  in  the  image  plane.  The 
classical  assumption  is  that  estimates  of  the  image  velocities  (u,  v)  can  somehow  be  obtained 
from  an  input  image  sequence.  The  task  is  then  to  compute  the  motion  t  =  [C7,  V,  W]r, 
u>  =  [A,B,C]t  as  well  as  the  depth  Z  given  the  velocities  (u,v). 

Note  that  the  motion  t  and  u>  is  a  global  quantity,  i.e.  there  is  one  rigid  body  motion  for 
all  points  in  the  image  plane.  The  depth  Z  however,  may  vary  spatially.  In  an  image  region 
of  n  x  n  pixels  we  therefore  have  2(n  x  n)  motion  field  equations  for  n  x  n  +  6  unknowns.  We 
also  observe  that  translation  and  depth  are  determined  only  up  to  a  common  scale  factor 
i.e.  if  t  and  Z  are  a  solution  to  the  motion  field  equations  then  so  are  Jbt  and  kZ,  for  any 
k  ^  0.  We  can  therefore  at  best  hope  to  recover  the  direction  of  translation  and  the  depth 
relative  to  that  translation.  Finally,  the  fact  that  the  unknowns  Z,  t  and  u>  are  related  in 
a  non-linear  fashion  in  equations  (4),  (5)  makes  the  motion  and  structure  estimation  from 
optical  flow  hard  and,  to  date,  unsolvable  in  closed-form. 

2.S  Optical  flow 

Estimating  the  optical  flow  (u,  v)  from  images  is  the  other  part  of  the  traditional  motion 
vision  problem.  We  wish  to  determine  a  vector  field  (u,  t>)  (i.e.  a  vector  at  every  point 
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(x,y)  in  the  image)  that  approximates  the  motion-field  (4),  (5)  as  closely  as  possible.  To 
accomplish  this  we  note  that  a  particular  optical  flow  vector  points  from  the  projection  of 
a  point  in  one  image  to  the  projection  of  the  point  in  the  subsequent  image.  Conceptually, 
an  optical  flow  vector  therefore  connects  corresponding  points  in  sequential  images. 

It  is  generally  assumed  that  the  brightness  of  corresponding  points  in  two  subsequent 
images  is  the  same.  Horn  and  Schunck  [26]  formulated  this  in 


(6) 


where  E  denotes  the  brightness  of  a  point  in  the  image  and  t  is  the  time.  This  can  be 
expanded  into  the  brightness  change  constraint  equation 


dE  dx  dE  dy  dE 
dx  dt  dy  dt  dt 


=  Exu  +  Eyv  +  Et  =  0 


(7) 


where  Ex,  Ey,  Et  are  the  brightness  derivatives  in  spatial  and  temporal  directions  and  (u,  v ) 
is  the  optical  flow  introduced  above.  The  brightness  gradient  may  be  obtained  by  a  finite 
differences  approximation  for  the  derivatives  directly  from  the  images. 

The  brightness  change  constraint  equation  links  the  brightness  values  E  measured  by  the 
camera  system  to  the  optical  flow  values  (u,  v)  needed  for  motion  and  structure  estimation. 
We  see  that  with  one  brightness  change  constraint  equation  at  every  image  point  (x,  y)  and 
two  unknowns  (u,  r)  we  are  dealing  with  an  underconstrained  problem.  We  must  therefore 
introduce  an  additional  constraint  such  as  the  smoothness  of  the  optical  flow  field  (as  done 
by  Horn  and  Schunck  [26]  and  Hildreth  [25])  in  order  to  solve  equation  (7)  for  the  desired 
(u,v)  everywhere. 

Nevertheless,  optical  flow  computation  is  both  expensive  in  terms  of  computation  time 
and  subject  to  both  systematic  error,  such  as  the  aperture  problem,  and  random  errors  due 
to  image  noise.  The  necessity  to  compute  the  optical  flow  has  been  a  major  obstacle  on 
the  path  to  real-time  motion  vision. 

We  might  add  that  criticism  of  the  validity  of  the  brightness  change  constraint  assump¬ 
tion  (7)  has  been  voiced  (see  Verri  and  Poggio  [51],  Schott  [44]).  Indeed,  equation  (7)  is 
only  a  simplified  approximation  of  reality  which  is  subject  to  errors  in  most  real  situations. 
The  main  advantage  lies  in  its  simplicity.  We  will  later  evaluate  the  amount  of  error  in  the 
brightness  change  constraint  equation  for  our  specific  application. 


2.4  Skipping  optical  flow:  The  direct  approach 

As  we  have  outlined,  the  estimation  of  scene  structure  and  relative  camera  motion 
was  traditionally  a  two-stage  process  consisting  of  the  computation  of  optical  flow  from 
brightness  values,  followed  by  the  recovery  of  motion  and  structure  via  the  motion  field 
equations.  In  this  approach,  the  optical  flow  is  no  more  than  an  intermediate  representation, 
the  computation  of  which  involves  a  number  of  difficulties. 

Horn,  Negahdaripour  and  Weldon  [38],  [40],  [27],  [28],  [39],  [42]  were  among  the  first 
to  point  out  that  the  computation  of  optical  flow  was  not  a  necessary  step  in  the  process 
of  determining  structure  and  motion  from  image  sequences.  The  main  advantage  of  this 
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approach  are  tremendous  improvements  in  computational  efficiency  since  the  computation 
of  optical  flow  is  expensive  as  we  have  pointed  out. 

Conceptually,  the  idea  is  as  follows:  The  brightness  change  constraint  equation  (7)  links 
brightness  values  to  optical  flow.  The  motion  field  equations  (4),  (5)  link  optical  flow  to 
rigid  body  motion  and  structure.  Instead  of  computing  optical  flow  from  (7)  and  then  using 
it  in  (4),  (5)  to  obtain  motion  and  structure  we  plug  the  motion  field  equations  (4),  (5)  into 
the  brightness  change  constraint  equation  (7)  and  obtain  one  equation  which  links  image 
brightness  gradients  to  motion  and  structure  parameters. 

We  can  rewrite  and  simplify  the  resulting  equation.  For  a  detailed  derivation  refer  to 
Horn  and  Weldon  [28].  We  obtain 

s  ■  t 

—=r  +  V  -w  +  Et  =  0  (8) 

in  which 

-fEx  Eyf  +  y(xEx  +  yEy)/f 

s=  -fEy  and  v=  -Exf  -  x(xEx  +  yEy)/ f  .  (9) 

xEx  +  yEy  yEx  -  xEy 

Note  in  (9)  that  s  and  v  contain  only  measureable  or  known  quantities  such  as  the 
derivatives  of  brightness  Ex,Ey,Et  and  the  location  (x,y)  at  which  we  are  evaluating  the 
constraint.  The  unknowns  in  equation  (8)  axe  therefore  the  motion  t  and  u>  and  the  depth 
Z  -  precisely  the  quantities  we  are  interested  in. 

In  this  formulation,  the  task  is  to  compute  n  x  n  values  of  Z  and  6  values  for  t  and 
u>  (actually  only  5  due  to  the  scale-factor  ambiguity)  from  n  x  n  nonlinear  equations.  In 
their  work  Horn,  Negahdaripour  and  Weldon  showed  how  motion  and  structure  parameters 
may  be  estimated  using  equation  (8)  for  a  number  of  special  cases  (pure  translation,  planar 
surfaces  etc.).  A  major  difficulty  they  encountered  was  the  fact  that  motion  and  depth  are 
linked  nonlinearly  in  (8)  and  cannot  both  be  recovered  in  general.  Note,  however,  that  if 
one  of  either  motion  or  depth  is  known,  the  other  quantity  is  easily  obtained  from  linear 
equations.  This  observation  is  the  key  to  the  iterative  estimation  of  both  structure  and 
motion  from  a  sequence  of  frames. 
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3  Recursive  Estimation  and  Dynamic  Motion  Vision 


The  structure  and  motion  estimation  problem  as  we  have  previously  outlined  it  is  in¬ 
herently  instantaneous.  Two  subsequent  frames  are  commonly  used  to  compute  the  optical 
flow  so  that  any  structure  or  motion  result  obtained  from  this  optical  flow  is  based  on  infor¬ 
mation  from  those  two  time  instances  only.  In  the  direct  approach  we  use  two  subsequent 
frames  to  compute  the  brightness  gradients  Ex ,  Ey  and  Et.  Structure  and  motion  recovered 
from  a  direct  method  corresponds  to  the  information  from  those  two  frames  only. 

As  we  will  see,  the  motion  and  in  particular  the  structure  information  obtainable  from  an 
instantaneous  snapshot  is  rather  noisy  in  general.  Conceptually  information  from  multiple 
frames  may  be  useful  to  reduce  this  noise  and  produce  a  useful  estimate  over  time.  There  are 
two  fundamentally  different  ways  in  which  multiple  frame  measurements  can  be  integrated. 

1.  Global  or  batch  methods.  In  this  case,  the  data  from  n  frames  is  collected  and  then 
input  to  the  integration  procedure.  We  expect  this  to  produce  the  most  accurate 
result  for  the  given  set  of  frames.  However,  it  necessitates  storage  of  large  amounts 
of  data  and  does  not  produce  any  result  until  all  frames  have  been  acquired. 

2.  Recursive  or  iterative  methods.  Here  we  maintain  a  current  estimate  of  the  quantities 
of  interest  which  are  updated  each  time  a  new  frame  becomes  available.  The  accuracy 
of  such  a  method  should  be  inferior  to  a  global  technique  and  convergence/stability 
becomes  an  issue.  However,  no  storage  is  required  and  an  estimate  is  available  at 

every  time  instant. 

* 

Due  to  the  continuous  nature  of  visual  information  acquisition  and  the  large  amounts  of 
data  involved  we  focus  here  on  a  recursive  estimation  method  for  temporal  integration. 

3.1  Recursive  Estimation  Theory 

There  are  many  in-depth  treatments  of  recursive  estimation  theory  such  as  Gelb  [17]. 
Knowledge  of  dynamical  systems  theory  is  also  useful  and  provided  by  standard  textbooks 
such  as  Luenberger  [32],  Franklin  and  Powell  [15],  Follinger  [14],  Willsky  [55]  and  many 
others.  We  present  here  an  extremely  simplified  summary  of  essential  concepts  in  recursive 
estimation  theory  which  we  will  use  for  time-continuous  motion  vision.  We  will  use  notation 
to  match  the  motion  vision  problem  to  make  the  applicability  of  this  theory  to  our  problem 
domain  apparent. 

Suppose  our  task  is  to  estimate  a  (scalar)  quantity  Z(k )  from  a  sequence  of  measure¬ 
ments  Zk  taken  at  discrete  points  k  in  time.  Suppose  further  that  the  Zk  are  generated  by 
a  stochastic  process 

Zk  =  Z(k)  +  nk  (10) 

where  nk  is  zero  mean  Gaussian  noise  of  known  variance  pk.  Finally,  we  know  that  the 
quantity  Z  changes  over  time  according  to  the  difference  equation 


Zk+ 1  =  f{Zk). 


(11) 
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The  goal  is  to  compute  at  every  time  instance  k  an  estimate  Zk  which  is  as  close  as 
possible  to  the  true  value  Z(k).  Kalman  formulated  and  solved  this  problem  in  a  far  more 
general  case  (see  the  literature  cited  above)  and  the  resulting  algorithm  is  referred  to  as  a 
Kalman  filter. 

The  Kalman  filter  maintains  an  estimate  Zk  and  a  variance  p*.  When  a  new  measure¬ 
ment  Zk  with  variance  pk  becomes  available  we  perform  the  following  update  operation  (the 
arrow  denotes  an  assignment): 


Zk/pk  +  Zk/pk 

1/Pfe  +  1/Pfc 
1 

1/p*  +  1/p* 


(12) 

(13) 


In  words:  the  new  estimate  of  Z  is  a  weighted  sum  of  the  old  estimate  and  the  new 
measurement  with  the  variances  as  weights.  If,  for  example,  a  very  noisy  measurement 
is  received,  its  variance  p  will  be  large  compared  to  the  variance  of  the  old  estimate  p. 
As  a  consequence  the  measurement  will  have  very  little  influence  on  the  new  value  of  the 
estimate.  Conversely,  a  high-quality  measurement  will  dominate  the  weighted  sum  due  to 
its  small  variance  and  cause  the  new  estimate  to  be  very  similar  to  its  value. 

The  update  formulas  (12)  and  (13)  may  also  be  obtained  from  a  maximum  likelihood 
estimation  of  Z  and  its  variance  given  the  previous  estimate  and  the  measurement  under 
the  assumption  that  the  estimate  Zk  and  measurement  Zk  are  uncorrelated. 

Notice  the  following  feature  of  the  variance  update  process  (13):  Suppose  without  loss 
of  generality  that  pk  <  Pk-  Now  (since  variances  are  positive) 

VnTwi.  <Wk=pk<Pi  (14) 

so  the  updated  variance  is  smaller  than  both  the  current  variance  and  the  measurement 
variance.  In  other  words,  variance  decreases  strictly  during  the  update  procedure.  This 
captures  the  notion  that  the  quality  of  the  estimate  improves  over  time  and  indicates  that 
later  measurements  will  have  less  and  less  influence  on  the  resulting  estimate. 

An  additional  difficulty  is  introduced  by  the  fact  that  Z{k)  may  vary  with  time.  It  is 
therefore  not  possible  to  simply  combine  the  result  Zk  obtained  through  an  update  at  time 
k  with  a  measurement  obtained  at  time  k  -t-  1.  We  must  first  predict  an  estimate  for  time 
k  +  1  using  the  known  temporal  behavior: 

Zk+ 1  -  f(Zk)  (15) 

*‘+i  -  (iz) pk-  (16) 

The  original  Kalman  filter  and  the  proofs  for  optimality  require  the  function  /  to  be  linear. 
We  will  see  that  /  is  indeed  linear  in  our  application. 

We  can  visualize  the  operation  of  the  Kalman  filter  with  the  help  of  a  block  diagram 
as  shown  in  figure  2.  Note  the  simplicity  and  locality  of  both  update  and  prediction  stages 
which  make  this  estimation  particularly  appealing  from  an  algorithmic  point  of  view. 
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Figure  2:  Block  diagram  of  a  Kalman  filter. 


3.2  Dynamic  Motion  Vision 

In  this  section  we  will  outline  how  the  theory  of  recursive  estimation  can  be  combined 
with  the  direct  approach  to  motion  vision.  The  result  is  a  time-continuous  estimation 
procedure  which  not  only  extends  the  one-frame  case  to  multiple  frames  but  allows  us  to 
recover  both  motion  and  structure  by  exploiting  the  iterative  nature  of  the  Kalman  filter, 
something  that  was  not  possible  previously.  We  refer  to  this  technique  as  “dynamic  motion 
vision”. 

Suppose  for  now  the  rigid  body  mption  vectors  t  and  u?  are  known.  In  this  case,  the 
brightness  change  constraint  equation  (8)  contains  only  one  unknown  and  can  be  solved  for 
the  depth  Z.  By  evaluating  it,  we  can  therefore  obtain  a  value  of  Z  at  every  point  (x,y)  in 
the  image  and  at  every  time  instance  k.  Due  to  noise  and  errors  in  the  brightness  change 
constraint  assumptions,  this  value  will  differ  from  the  true  value  of  Z  of  the  point  which 
is  imaged  at  that  pixel.  Further,  the  value  of  Z  at  a  particular  pixel  changes  over  time 
due  to  the  relative  motion  between  object  and  observer.  Using  the  equations  of  rigid  body 
motion,  this  change  in  depth  can  be  derived  precisely. 

The  recursive  estimation  task  can  therefore  be  formulated  as  follows.  We  wish  to  es¬ 
timate  the  depth  Z  at  a  particular  pixel  ( x,y )  in  the  image  plane.  For  each  new  image 
we  can  compute  a  noisy  measurement  Z*  using  the  brightness  change  constraint  equation. 
This  measurement  can  be  used  to  update  an  estimate  Z*  according  to  the  Kalman  filter 
update  rule.  As  the  next  frame  k  +  1  becomes  available,  the  depth  value  changes  due  to  the 
rigid  body  motion:  Z*+ 1  =  /(Z*).  This  known  transformation  can  then  be  used  to  predict 
the  estimate  for  the  depth  value  Zjt+i  in  the  next  iteration.  A  block  diagram  depicting  this 
operation  is  shown  in  figure  3. 

Note  that  the  previous  formulation  describes  the  recursive  estimation  of  Z  at  one  par¬ 
ticular  pixel  location  (z,y).  In  general  we  will  be  interested  in  obtaining  depth  information 
for  every  pixel  in  the  image.  To  accomplish  this  we  can  imagine  a  Kalman  filter  positioned 
at  every  pixel  location  where  it  continuously  estimates  the  depth.  During  the  prediction 
stage,  some  interaction  will  take  place  between  these  spatially  distributed  Kalman  filters 
when  the  projections  of  real  world  points  of  which  depth  is  being  estimated  move  to  new 
image  locations  due  to  rigid  body  motion. 
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Figure  3:  Block  diagram  of  Kalman  filter  in  motion  vision. 

So  far,  we  have  assumed  that  the  motion  parameters  t  and  u>  are  known.  This  may 
be  the  case  in  a  number  of  practical  situations,  in  particular  when  the  camera  is  mounted 
on  vehicle  or  robot.  In  this  case,  the  method  outlined  above  is  a  complete  solution  to 
the  recursive  structure  estimation  problem.  The  even  more  interesting  problem,  however, 
arises,  when  the  motion  is  also  unknown  and  must  be  estimated.  The  key  idea  is  the 
following.  When  depth  Z- is  known  at  every  location  (*,y),  we  can  solve  the  brightness 
change  constraint  equation  (8)  for  the  motion  vectors  t  and  u>.  However,  depth  is  unknown 
•  it  is  precisely  what  the  filter  is  estimating.  Now,  in  each  iteration  of  the  filter  we  have  a 
current  estimate  of  the  depth  Z.  The  idea  is  then  to  use  the  current  filter  estimate  Z  for 
the  computation  of  motion  from  the  brightness  gradient  using  equation  (8).  This  process 
can  be  initialized  by  starting  with  an  arbitrary  constant  value  (since  the  motion  estimation 
adjusts  for  the  unknown  scale  factor)  for  Z  everywhere. 

This  procedure  of  alternatively  estimating  depth  and  motion  relies  on  the  fact  that 
motion  estimation  is  rather  insensitive  to  errors  in  depth  as  we  show  in  our  experiments. 
Conceptually,  we  can  think  of  the  motion  vectors  as  parameters  for  our  filter.  Inserting  the 
motion  estimation  between  update  and  prediction  stage  of  the  filter  therefore  corresponds 
to  an  online  adaptation  of  the  filter  paramters.  Such  an  adaptive  filter  is  of  course  not  in 
line  with  the  classical  Kalman  filter  theory  and  theoretical  properties  derived  for  the  latter 
do  not  extend  in  a  straightforward  way. 

Figure  4  shows  the  integration  of  motion  estimation  into  the  dynamic  motion  vision 
process.  This  block  diagram  represents  the  structure  which  we  have  implemented  and  the 
components  of  which  we  will  describe  in  detail  in  the  following  sections. 

For  the  sake  of  clarity  we  have  omitted  another  block  which  should  be  positioned  be¬ 
tween  update  and  prediction  part  of  the  filter.  Our  implementation  allows  us  to  perform  a 
variety  of  spatial  noise  reduction  operations.  This  was  motivated  by  two  facts: 

1.  The  filter  estimation  process  does  not  take  spatial  smoothness  into  account:  each  filter 
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Figure  4:  Block  diagram  of  dynamic  motion  vision  with  motion  estimation. 

at  each  pixel  operates  independently.  A  spatial  smoothing  step  may  be  desirable  to 
enforce  a  certain  amount  of  smoothness. 

J 

2.  Most  any  image  contains  regions  of  uniform  brightness  in  which  no  depth  information 
is  obtainable  from  motion.  Mathematically  this  is  seen  by  the  fact  that  all  components 
of  the  brightness  gradient  are  zero,  or  very  small,  in  such  a  region  thereby  providing 
little  or  no  constraint  on  the  value  of  the  depth.  Even  repeated  measurements  will 
provide  poor  data  with  large  variances  in  such  regions.  In  such  cases  it  may  be 
desirable  to  fill  in  data  from  adjacent  regions  of  lower  variance. 

Filling  in  and  smoothing  are  simply  cosmetic  operations  which  do  not  contribute  to  the 
underlying  idea  of  temporal  noise  reduction  via  recursive  estimation.  They  are  therefore 
purely  optional  in  our  implementation. 

The  following  sections  contain  detailed  descriptions  of  the  individual  modules  used  in 
the  dynamic  motion  vision  process. 
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4  Direct  Depth  Estimation 

The  first  stage  of  the  dynamic  motion  and  structure  estimation  algorithm  is  the  com¬ 
putation  of  depth  from  the  input  image  data.  As  shown  in  the  structural  block  diagram 
in  figure  4,  the  input  to  the  depth  estimation  stage  are  the  brightness  derivatives  Ex,  Ey 
and  Et  and  the  motion  t  and  w.  The  output  is  the  depth  Z  and  its  variance  at  every  pixel 

(x,y)- 

In  this  section,  we  describe  how  we  compute  the  gradient  of  image  brightness.  We  then 
show  how  depth  may  be  obtained  in  a  straightforward  way  from  brightness  gradients  and 
motion.  Using  data  from  real  images  we  will  show  that  this  straightforward  solution  is 
quite  unsatisfactory.  We  then  suggest  a  way  to  improve  the  solution  method  to  produce 
better  results  and  illustrate  the  achievable  improvement  with  more  image  data. 

4.1  Brightness  gradient  computation 

We  are  given  discrete  samples  of  the  brightness  function  E(x,  y,  t).  The  objective 
is  to  estimate  the  first  derivatives  of  E  with  respect  to  all  of  its  independent  variables.  The 
simplest  approximation  is  obtained  using  central  differences 


•£*(*'*  j*k)  —  {Ei+ij'ic  —  £j_j  ji*)/2Ax 

(17) 

Ey(i,j,k)  =  (Eij+v,  -  Eij- i,*)/2Ay 

(18) 

=  (•E.j.fc+i  -  £’«,j,ife-i)/2At, 

(19) 

where  Ax  and  A y  denote  the  physical  spacing  of  pixels  on  the  CCD  chip  in  horizontal 
and  vertical  directions  and  At  denotes  the  time  elapsed  between  the  acquistion  of  two 
subsequent  images. 

Horn  and  Schunck  [26]  suggested  an  alternative  in  which  a  derivative  value  is  not  com¬ 
puted  from  a  difference  across  a  single  pixel  but  as  the  average  of  four  differences  at  neigh¬ 
boring  pixels: 


Ex(i , 

Jyk) 

-  (Ei+i,j,k  - 

Ei,j,k  +  Ei+i,j+i'k 

-  Eij+i,k  + 

(20) 

Ei+i'j'k+i 

-  Ei,j,k+ 1  +  £.+i,j+i,fc+i  -  -Ei,j+itJc+i)/4Ax 

Ey(h 

-  (£i,j+i,fc  - 

Ei,j,k  4-  •Et+i.j+i,* 

-  Ei+\j'k  + 

(21) 

Ei,j+l,k+ 1 

~  Ei,j,k+ 1  +  Ei+l<j+lMl  -  •Et+ijI*+i)/4Ay 

Et(i , 

J.*) 

=  {Ei,j,k+ 1  “ 

Ei,j,k  +  Ei+i,j'k+i 

-  Ei+ijtk  + 

(22) 

Ei,j+l,k+l 

-  Ei,j+i,k  +  Ei+i'j, 

*+i  -  •^i+ij+i,*)/2At 

Notice  that  the  derivative  approximations  are  computed  from  pixel  values  at  the  corners  of 
corresponding  grid  squares  in  subsequent  images  as  shown  in  figure  5. 

The  pixels  form  a  spatio-temporal  cube.  The  derivative  values  computed  from  equations 
(20),  (21),  (22)  are  therefore  the  estimates  of  derivative  at  the  center  of  this  cube  -  inbet  ween 
frames  and  inbetween  pixels.  Since  all  computations  in  the  direct  dynamic  motion  vision 
algorithm  are  carried  out  on  gradients  (the  images  can  be  discarded  once  the  gradients 
have  been  computed),  the  relative  position  of  pixels  and  gradient  values  poses  no  difficulty 
during  the  computation.  Note  however,  that  the  gradient  field  will  have  one  row  and  one 
column  less  than  the  images  used  for  its  computation. 
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Image  k+1 
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Figure  5:  Location  of  pixel  values  for  gradient  computation 


4.2  Single  pixel  depth  computation 


We  have  shown  how  the  brightness  gradient  Em,  Ev,  Et  may  be  computed  and  we  are 
assuming  that  the  motion  vectors  t  and  u  are  known.  To  compute  the  depth  we  use  the 
brightness  change  constraint  equation  (8) 


a  *  t 

— — -  +  v  •  w  +  Et  =  0 

z 

which  we  can  solve  for  the  desired  depth  Z 


Z=  - 


i’t 

Et  +  v  •  u> 


(23) 


(24) 


To  obtain  an  estimate  of  the  variance  in  the  value  of  Z  we  assume  that  the  brightness  E 
at  every  pixel  is  corrupted  by  noise  n  of  variance  <fg  which  is  identically  distributed  at  every 
pixel  and  mutually  uncorrelated  between  pixels.  The  noise  in  the  eight  brightness  values 
used  in  the  computation  of  the  brightness  derivatives  propagates  through  the  derivatives  and 
appears  in  the  depth  value.  Unfortunately,  the  relationship  between  depth  and  brightness 
values  is  nonlinear  and  rather  complex.  We  will  outline  briefly  how  a  formula  for  the 
variance  p—o^  may  be  obtained. 

As  mentioned,  eight  brightness  values  Ei,  *  =  1, ...  8  at  the  corners  of  a  spatio-temporal 
cube  contribute  to  the  value  of  Z: 


Z  =  /(£?!,...,£») 


(25) 


Under  the  assumption  that  the  nonlinear  function  /  can  be  locally  approximated  by  the 
first-order  terms  of  its  Taylor  series,  the  variance  in  Z  is  given  by 


(26) 
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To  compute  the  necessary  partial  derivatives  we  first  determine  the  immediate  relationship 
between  Z  and  the  brightness  derivatives  by  plugging  in  the  values  for  s  and  v  from  equation 
(9)  into  the  depth  equation  (24).  We  find 


in  which 


r,  _  +  bEv 

Et  +  cEm  -(■  dEy 

(27) 

a  = 

fU-xW 

(28) 

b  = 

fV-yW 

(29) 

c  = 

( xyA-(f3  +  x3)B)/f  +  yC 

(30) 

d  = 

((/*  +  -  xyB)/f  -  xC 

(31) 

(Recall  that  /  is  the  principal  distance,  t  =  [U,  V ,  W]T  and  u>  =  [4,  B,  C]r).  Next,  we 
plug  in  the  formulas  for  the  partial  derivatives  Em  (20),  Ev  (21),  Et  (22)  which  would 
require  more  space  than  we  have  available  here.  However,  after  differentiating  the  result 
with  respect  to  the  eight  brightness  values  according  to  (26),  some  simplification  can  be 


done  and  we  obtain 

P  = 

a 

'  i 

(— )3( 

V 

— “ -)J  +  (“— )a(~“)J  +  (— ~)af  \*\ 
dEm}  +(±y)(dEy)  ) 

(32) 

where 

ez 

be. 

aEt  +  (ad  -  bc)Ev 
(Et  +  c  Em  +  dEy)2 

(33) 

ez 

bEt  —  (ad  -  be) Ex 

(34) 

dEy 

(Et  +  cEm+dEy)2 

9Z 

aEm  +  bEy 

(35) 

dEy 

(Et  +  cEm  +  dEy)2' 

In  conclusion,  the  derivation  is  extremely  complex  and  the  resulting  formula  is  expensive 
to  compute.  Moreover,  the  result  is  only  valid  approximately  due  to  the  Taylor  series.  It 
would  be  extremely  useful  to  find  a  simpler  approximation  that  serves  the  same  purpose: 
quantifying  the  quality  of  the  measurement  Z. 

We  will  now  briefly  investigate  the  performance  of  the  depth  estimation  scheme  out¬ 
lined  above.  This  is  necessary,  since  we  are  relying  on  the  brightness  change  constraint 
assumption.  We  investigated  two  scenes  in  which  the  camera  translated  before  a  planar 
surface  (a  poster)  that  was  parallel  to  the  image  plane.  Images  of  the  scenes  are  shown  in 
figures  6  a)  and  b).  We  will  refer  to  them  as  “wave”  and  “onnanoko”. 

Planar  scenes  are  particularly  useful  for  the  evaluation  we  intend  to  perform  since  we 
can  use  the  mean,  variance  and  histogram  of  all  the  computed  depth  values  to  present 
results  compactly.  In  the  case  of  the  wave  image  a  motion  of  t  =  [1.5, 0, 3]  mm  was  used, 
the  plane  was  at  a  distance  of  700  mm  initially.  The  depth  was  then  computed  using  the 
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Figure  6:  The  planar  scenes:  a)  “wave”  b)  “onnanoko”. 
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Figure  7:  Statistical  analysis  of  depth  values  computed  for  planar  scenes:  a)  wave  scene  b) 
onnanoko  scene. 

formulas  derived  above  and  the  known  values  for  the  motion.  The  resulting  histogram  is 
shown  in  figure  7  a). 

We  find  that  the  distribution  of  depth  values  is  extremely  broad  with  a  range  of 
[-22100,22652].  The  standard  deviation  of  1532  is  more  than  twice  as  large  as  the  ac¬ 
tual  distance  of  the  plane.  Ideally,  this  distribution  should  have  mean  700  and  variance  0 
(or  at  least  small  with  respect  to  the  mean).  In  addition  we  have  many  negative  values 
which  are  physically  impossible  (depth  must  be  positive).  In  short,  the  depth  obtained  by 


17 


this  method  is  subject  to  extremely  large  errors  and  almost  useless. 

A  similar  observation  can  be  made  for  the  onnanoko  scene.  Here  the  motion  was  t  = 
[1.5,0, —3]  mm  with  the  planar  surface  at  500  mm  initially.  The  results  of  the  statistical 
analysis  of  the  resulting  depth  map  are  shown  in  figure  7.  The  conclusion  is  the  same:  too 
much  error  to  be  useful. 


4.3  Patch  depth  estimation 

Apparently,  the  quality  of  results  obtained  by  evaluating  the  brightness  change  con¬ 
straint  equation  at  individual  pixels  in  real  images  is  limited.  The  above  results  can  be 
used  directly  to  quantify  the  amount  of  error  involved  in  the  brightness  change  constraint 
assumption.  This  error  will  be  significant  in  general. 

From  our  experience  with  motion  estimation  via  the  brightness  change  constraint  equa¬ 
tion,  however,  (see  section  6.2)  we  knew  that  it  could  be  applied  with  great  success  in 
cases  where  data  was  taken  from  an  entire  image  region  and  then  used  to  estimate  a  global 
quantity  in  that  region  in  a  least-squares  fashion.  This  prompted  us  to  investigate  depth 
estimation  from  a  patch  rather  than  a  single  pixel. 

The  idea  is  as  follows.  Pick  a  patch  size  n  (typically  a  few  pixels).  Assume  Z  is  constant 
within  a  given  patch.  This  gives  usnxn  constraint  equations 


s  ■  t 

—  +  V  •  u>  +  Et  -  0 


(36) 


that  all  contain  the  same  Z.  Find  the  value  of  Z  that  minimizes  the  sum  of  squared  errors 

(37) 


*<z)=  E(:r+v-"+£')1- 

x,ytP  * 


in  which  P  is  the  patch  we  have  chosen.  To  solve  the  minimization  problem  we  substitute 
d  =  1/Z  to  make  the  problem  linear 


$'(d)  =  E  (d(s  •  *)  +  v  •  w  +  Et)2. 

x,V*P 


(38) 


We  differentiate  with  respect  to  d 

d* 


~dd=2H  (d(s*t)  +  V-W  +  £<)(*•  t) 


(39) 


r.yeP 


to  find  the  necessary  condition  for  a  minimum 


das  - 


£(*-«  +  £l)(*-t) 

x,y*P _ 

E(»-*)2 

*,V*P 


(40) 
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Through  backsubstitution  we  finally  obtain  the  expression  for  Z 

\2 


Z  =  - 


2  0»*t)2 

x,yeP 


x,ytP 


We  note  that  in  the  trivial  case  of  a  single-pixel  patch  we  have 


Z- - 


s  •  t 


(v  -u>  +  £t)(s  •  t)  u  •<*>  +  £« 


(41) 


(42) 


which  is  the  same  as  the  formula  (24)  that  we  derived  for  the  single  pixel  estimation  when 
s  •  t  ^  0. 

The  next  task  would  be  to  derive  an  expression  for  the  variance  of  Z  in  the  same  way 
demonstrated  in  the  previous  subsection.  Only  now  the  number  of  brightness  values  which 
contribute  to  Z,  and  which  must  therefore  be  used  in  the  derivation  of  the  variance,  is 
2(n  +  l)2  where  n  is  the  width  and  height  of  the  patch.  From  our  experience  in  the  single 
pixel  case  we  know  that  this  will  not  only  require  an  extremely  lengthy  derivation,  it  will 
also  lead  to  a  very  complex  expression  for  the  variance.  Instead,  we  choose  to  use  a  simple 
approximation  to  the  variance  values  which  exhibits  the  same  properties  that  we  require 
for  the  recursive  estimation  scheme. 

Recall  that  the  purpose  of  the  variance  is  to  indicate  the  quality  of  the  measurement  that 
it  accompanies.  In  our  case,  the  measurement  (depth  Z)  is  computed  as  a  quotient.  The 
magnitude  of  the  denominator  is  of  particular  importance.  When  it  is  zero,  the  depth  cannot 
be  obtained  at  all.  When  it  is  small,  the  result  is  very  sensitive  to  errors  in  the  measurements 
that  enter  into  the  numerator.  The  significance  of  the  denominator  is  reflected  in  the  fact 
that  it  appears  squared  in  the  denominator  of  the  expression  for  the  variance  in  the  single 
pixel  case.  In  analogy  to  this  we  chose  the  following  expression  to  approximate  p: 


_ _ n4S _ 

r~  (£  (••<■>  +  £.)<»•  »))3 
x,ytP 


(43) 


where  n  is  once  again  the  width/height  of  the  patch  and  5  is  a  constant  scale  factor.  The 
term  n4  is  used  to  normalize  the  variance  with  respect  to  the  patch  size.  The  scale  factor  can 
be  chosen  arbitrarily.  However,  when  the  depth  estimator  is  used  as  part  of  the  recursive 
estimation  scheme  its  choice  must  be  coordinated  with  the  initial  value  of  the  variance  in 
the  filter.  Their  relationship  determines  the  convergence  property  of  the  filter. 


4.4  Evaluating  the  patch  depth  estimation  procedure 

We  now  compare  the  patch  depth  estimation  with  the  single  pixel  estimation  using  the 
same  images  and  statistical  analysis  as  above.  Figures  8  a)  and  b)  show  the  histograms  of 
the  depth  values  obtained  from  the  wave  sequences  using  patch  sizes  of  4  and  9.  For  better 
comparison  with  the  single  pixel  case,  the  same  histograms  truncated  to  the  range  used  for 
the  single  pixel  histogram  7  are  shown  in  figures  9  a)  and  b). 
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Figure  8:  Statistical  analysis  of  depth  values  computed  for  the  wave  scene  using  patch 
depth  estimation  with  a)  4  pixel  and  b)  9  pixel  patches. 
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Figure  9:  Statistical  analysis  of  depth  values  computed  for  the  wave  scene  using  patch 
depth  estimation  with  a)  4  pixel  and  b)  9  pixel  patches.  Histogram  ranges  truncated  to 
that  of  the  single  pixel  histogram  above. 

We  expect  that  increasing  patchsize  will  decrease  the  variance  and  range  of  the  data 
and  move  the  mean  towards  the  true  depth  value  of  700.  Surprisingly,  we  find  that  for  a 
patch  size  of  4  the  range  of  depth  values  and  the  variance  increase  (figure  8  a)  with  respect 
to  the  single  pixel  case  (figure  7  a).  However,  when  both  histograms  are  viewed  at  the 
same  scale,  it  becomes  apparent  that  this  effect  is  due  to  a  small  number  of  outliers  that 
appear  in  the  patch  estimation  case.  The  distribution  has  in  fact  become  narrower  as  we 
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had  expected.  In  comparing  the  9  pixel  case  8  b),  9  b)  we  see  that  standard  deviation  and 
range  have  been  reduced  by  2  orders  of  magnitude  with  respect  to  the  single  pixel  case  •  a 
significant  improvement. 

The  appearance  of  the  outliers  can  be  explained  theoretically.  All  outliers  in  these 
experiments  were  due  to  nearly  vanishing  denominators  in  the  expression  (41)  for  computing 
depth  from  patches.  Compare  this  with  the  expression  (24)  used  for  depth  from  single  pixel. 
In  patches  of  uniform  brightness  the  z  and  y  derivatives  of  E  which  appear  in  the  vectors  s 
and  v  will  be  zero.  In  the  single  pixel  estimation  case  of  equation  (24)  this  will  not  lead  to 
a  singularity  unless  Et  vanishes.  In  the  patch  case  of  equation  (41),  however,  zero  spatial 
derivatives  cause  a  singularity. 

Experiments  run  on  the  onnanoko  sequence  confirm  these  observations.  Despite  the 
appearance  of  outliers  the  distribution  of  depth  values  narrows  considerably  when  larger 
patch  sizes  are  employed.  Although  the  improvement  over  the  single-pixel  depth  estimation 
is  considerable,  estimating  depth  from  two  frames  by  such  a  procedure  is  rather  unsatisfac¬ 
tory. 

In  practice,  special  measures  should  be  taken  to  eliminate  the  depth  outliers.  A  sim¬ 
ple  procedure  is  as  follows:  depth  must  be  positive  and  is  also  bounded  from  above.  Our 
implementation  therefore  includes  a  simple  procedure  which  succeeds  the  patch  depth  es¬ 
timation  process.  It  searches  for  depth  values  that  exceed  the  bounds  and  replace  them  by 
the  average  of  neighbors  that  conform  to  the  bounds. 

4.5  Evaluating  the  variance  computation  procedure 


Figure  10:  The  pepsi  scene 

To  evaluate  the  performance  of  the  variance  measure  that  we  have  introduced  we  make 


use  of  yet  another  real  image  sequence.  In  this  experiment  shown  in  figure  10,  a  soda  can 
was  positioned  on  a  table  in  front  of  the  camera  at  a  distance  of  570  mm.  The  background 
is  a  plane  parallel  to  the  image  plane  at  a  distance  of  1240  mm.  The  camera  translated 
t  as  [1.5, 0,0]  mm  between  the  two  frames  used  to  compute  the  gradient. 


Figure  11:  a)  Depth  and  b)  logarithm  of  variances  from  the  pepsi  scene  using  patch  estimate 
with  patch  size  9. 

Figures  11a)  and  b)  show  the  depth  and  variance  computed  by  the  patch  estimator  with 
a  patch  size  of  9.  The  brightness  of  a  point  in  these  figures  is  proportional  to  the  magnitude 
of  the  value  at  that  location  where  dark  points  correspond  to  small  values  and  light  points 
correspond  to  large  values.  Due  to  the  outliers  in  the  depth  map  and  the  wide  range  of  the 
variance  values  the  brightness-encoded  presentation  of  this  data  particularly  difficult.  For 
this  reason  the  depth  values  have  therefore  been  truncated  to  a  range  of  [500,  3000].  This 
range  contains  over  97  %  of  all  depth  values.  To  present  the  variances  we  add  1  and  take 
the  natural  logarithm.  Figure  lib)  therefore  illustrates  the  spatial  distribution  of  variances 
qualitatively  and  the  actual  variation  between  the  variances  is  orders  of  magnitudes  larger 
than  they  appear  thus  enhancing  the  effects  noted  below. 

Note  the  large  errors  in  depth  that  appear  on  the  lower  right  of  the  can  and  the  frontal 
portion  of  the  table  the  can  is  standing  on.  In  comparing  with  the  image  we  find  that 
these  regions  are  basically  uniform  in  brightness  so  that  a  zero  brightness  gradient  results. 
We  therefore  expect  the  depth  estimate  in  those  regions  to  be  poor.  As  a  consequence  the 
variances  there  should  be  large.  Note  how  the  brighter  areas  in  the  variance  figure  pick 
out  the  areas  in  which  the  error  in  depth  is  large.  Since  the  distribution  of  variances  is  in 
actuality  orders  of  magnitude  wider  than  the  logarithmic  intensity  map  suggests,  poor  depth 
estimates  are  very  easily  identified  by  their  variance  values.  These  observations  not  only 
support  the  approximation  we  have  made  to  obtain  the  variance  values  but  also  suggests  a 
way  by  which  regions  of  poor  depth  estimates  may  be  “filled  in”  using  surrounding  values 
of  lower  variance. 
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4.6  Algorithmic  aspects  of  patch  depth  estimation 

Finally  a  note  on  the  algorithmic  aspects  of  depth  estimation  using  the  direct  patch 
method.  The  approach  outlined  above  can  be  implemented  in  a  straightforward  way  in 
time  0(n2k 2)  where  n  is  the  width/height  of  the  image  and  k  is  the  width/height  of  the 
patch.  We  simply  compute  the  sums 

S,  =  S(.-t)1 

x.ytP 

S a  =  ]T  (v  •  u>  +  Et)(a  •  t) 

x,ytP 

for  each  of  the  (n  —  k  +  1)  x  (n  -  k  +  1)  possible  patches  where  each  sum  consists  of  k  x  k 
values. 

Using  the  idea  of  running  sums  it  is  possible  to  reduce  the  computation  time  by  a  factor 
of  k 2.  Notice  that  the  patch  used  to  compute  S\  and  52  for  a  given  location  ( i,j )  in  the 
image  differs  only  by  one  column  of  k  elements  from  the  patch  used  in  the  computation  at 
location  (» + 1,  j).  The  idea  is  therefore  to  compute  the  full  kxk  sums  only  once  (in  the  top 
left  corner  of  the  image  for  example)  and  to  compute  neighboring  values  by  simply  adding 
a  new  rightmost  column  and  subtracting  the  leftmost  column.  The  same  idea  can  be  used 
along  the  rows  of  the  image,  so  that  each  pixel  is  only  used  once  in  the  computation  of  all 
sums  and  the  worst-case  complexity  is  0(n2). 

In  practice,  a  circular  buffer  is  used  to  hold  the  values  of  the  sums  in  each  column  and 
row  of  the  current  patch.  Some  overhead  is  associated  with  maintaining  this  buffer  so  and 
errors  will  propagate  through  the  entire  image  in  this  case.  We  therefore  chose  to  apply 
the  running  sum  technique  only  along  each  line  of  the  image  and  recompute  the  kxk  sum 
at  the  beginning  of  each  line. 


(44) 

(45) 


5  The  Update  Stage 


As  we  see  fr  om  the  dynamic  motion  vision  block  diagram  4  the  task  of  the  update 
stage  is  to  take  as  input  a  depth  measurement  Z  and  its  variance  p  and  combine  it  with  a 
current  estimate  of  depth  Z  and  variance  p  to  update  the  estimate. 

5.1  The  update  equations 

In  our  presentation  of  the  recursive  estimation  theory  we  have  introduced  notation  that 
will  allow  us  to  map  the  theoretical  results  directly  onto  the  motion  vision  problem  domain. 
The  measurement  Zk  in  our  case  is  the  depth  at  a  particular  pixel  (x,  y)  in  the  fcth  frame. 
The  variance  p*  is  the  variance  of  the  depth.  In  the  previous  section  we  have  outlined,  how 
both  values  may  be  obtained  directly  from  image  brightness  values. 

It  follows  that  one  filter  will  maintain  an  estimate  Z  and  its  variance  p  at  a  particular 
pixel  (x,  y )  in  the  image.  We  can  think  of  having  one  filter  at  every  pixel  that  estimates 
the  value  there.  The  update  equation  for  each  one  of  these  filters  is  identical  and  extremely 
simple.  We  have 


Zk/pk  +  Zk/Pk 
1/Pk  +  VPfc 
1 

1/Pfc  +  1/Pfc 


(46) 

(47) 


which  is  taken  directly  from  the  simplified  Kalman  filter  equations  (12)  and  (13).  The 
theoretical  properties  of  this  update  process  have  been  discussed  in  detail  in  subsection  3.1. 


5.2  Evaluating  the  update  process 

To  study  the  effect  of  the  update  procedure  we  show  results  obtained  from  the  pepsi 
sequence  introduced  in  the  previous  section  (figure  10).  The  depth  measurement  Z*  and 
its  variance  pk  are  computed  from  the  brightness  gradient  using  the  patch  depth  estimate 
described  in  the  previous  section  with  a  patch  size  of  9  pixels  and  using  the  known  motion 
t  as  [1.5, 0,0]  mm.  The  resulting  depth  map  is  shown  in  figure  12. 

For  the  update  procedure  we  used  a  previous  estimate  Zk  corresponding  to  a  frontal 
plane  at  1000  mm  distance.  Since  initially  we  have  no  information  about  the  shape  of  the 
scene  a  flat  depth  map  is  a  plausible  way  to  start.  The  choice  of  initial  variances  pk  should 
be  large  so  as  to  indicate  the  complete  uncertainty  about  the  current  estimate.  Since  the 
square  root  of  the  variance  is  the  standard  deviation  of  the  error  we  used  a  variance  of 
p  =  10002. 

Now  we  update  the  depth  estimate  and  its  variance  according  to  the  update  equations 
(46),  (47).  The  resulting  depth  map  is  shown  in  figure  12  b).  As  a  result  of  the  update, 
the  flat  depth  map  is  almost  completely  replaced  by  the  incoming  measurement  except  in 
areas  of  large  error  as  indicated  by  the  variance  values,  locations.  This  is  precisely  the 
effect  we  expect  from  the  update  procedure  and  we  can  imagine  how  a  repeated  application 
of  this  process  over  time  can  significantly  reduce  the  amount  of  error  present  in  a  single 
measurement. 
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Figure  12:  Depth  maps  from  the  pepsi  scene  a)  measured  using  the  patch  estimator  with 
patch  size  9,  b)  after  the  update  process  with  an  initially  flat  depth  map. 

5.3  Algorithmic  aspects  of  the  update  process 

The  serial  complexity  of  the  update  process  is  easily  seen  to  be  0(nJ)  with  n  being 
the  width/height  of  the  image.  Given  n  x  n  processors  on  a  parallel  machine  this  can  be 
reduced  to  0(1)  since  the  update  equations  are  local  to  each  pixel.  The  filter  update  is 
indeed  the  fastest  of  all  operations  that  constitute  the  dynamic  motion  vision  algorithm. 


6  The  Prediction  Stage 


The  task  of  the  prediction  stage  as  we  see  from  the  block  diagram  of  figure  4  is  to 
determine  how  a  depth  and  variance  map  at  time  k  will  appear  to  the  observer  at  time 
k  4-  1  assuming  that  the  relative  motion  of  observer  and  surface  are  given  by  the  motion 
vectors  t  and  u>.  Informally  this  may  be  stated  as  follows:  given  an  estimated  depth  map 
and  a  3D  motion,  what  will  the  depth  map  look  like  after  the  motion? 

We  state  the  problem  more  formally.  Let  (X,  Y,  Z)  denote  coordinates  of  a  point  in  the 
real  world  and  (z,  y)  denote  the  coordinates  of  its  projection  in  the  image  plane.  Given  an 
array  of  depth  values  Z(ij,  y,)  for  i  =  0, . . . ,  h  —  1  and  j  =  0, . . . ,  tv  -  1  (i.e.  on  a  regular 
grid)  and  a  motion  t  =  [U,V,W]T,  u  =  [A, 5, C]T,  what  are  the  values  of  Z(xj,y,)  after 
the  given  motion? 

Since  the  Z  values  are  known  only  at  the  grid  points,  it  does  not  suffice  to  simply 
transform  the  3D  coordinates  corresponding  to  the  given  motion.  The  transformed  values 
may  not  coincide  with  grid  points,  they  may  have  become  occluded  or  moved  out  of  the 
field  of  view.  A  resampling  of  the  warped  surface  and  hence  some  form  of  interpolation 
between  the  given  samples  is  necessary. 

An  additional  problem  is  the  following:  We  interpret  the  given  values  of  Z(xj,  y,)  as  nor¬ 
mally  distributed  random  variables  with  variance  p(Xj,yi).  What  are  the  value  of  p(zj,  y.) 
after  the  warping?  The  answer  to  this  question  is  a  integral  part  of  the  dynamic  motion 
vision  algorithm.  We  present  here  the  algorithm  for  warping  the  depth  map  Z(xj,yi).  The 
transformation  of  variances  presents  no  fundamentally  different  challenges  but  is  rather 
lengthy  in  nature  so  we  chose  to  attach  it  as  an  appendix  (Appendix  A). 

6.1  Warping  the  depth  map:  outline  of  the  algorithm 

The  outline  of  the  algorithm  is  as  follows: 

1.  Using  the  equations  of  perspective  projection  we  compute  the  3D  coordinates  corre¬ 
sponding  to  the  given  depth  values.  Thus,  we  have  the  transformation 

{Xiji  Vi}!  Zij)  ( Xij ,  Yij ,  Zij)  (48) 

2.  Using  the  equations  of  motion  we  tranform  the  3D  points  into  the  new  coordinate 
system.  The  transformation  is 

(X^Yij^^-^iXl^Z'j)  (49) 

3.  For  each  pixel  (x,y)  in  the  new  coordinate  system  we  seek  to  determine  the  location 
in  3D  where  the  ray  through  (z,y)  intersects  a  surface  through  the  points  computed 
in  the  previous  step.  The  Z  coordinate  of  the  intersection  becomes  the  Z  value  stored 
at  (x,y).  This  procedure  is  referred  to  as  resampling. 

4.  Grid  values  of  Z  that  remained  undetermined  after  the  previous  resampling  step 
(parts  of  the  surface  which  enter  the  field  of  view  due  to  the  motion)  must  be  filled. 
We  extrapolate  from  the  known  values  in  a  straightforward  way. 

We  now  elaborate  the  details  of  the  above  algorithm. 
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Figure  13:  A  depth  surface  Z(x,y)  before  prediction  warping. 

0.1.1  The  initial  situation 

We  start  with  Z(xjt  y<)  =  Zij  for  *  =  Q, ...  ,h  —  1  and  j  =  0,...,u>  —  1  as  the  depth 
map.  Xj  and  yj  are  given  by 


*,•  =  (j  -  “y  )Az  and  yi  =  (*  -  ^y-^Ay 


(50) 


where  w,  h  are  the  dimensions  of  the  camera  pixel  array  and  Az,  Ay  are  the  spadngs 
of  pixels  in  horisontal  and  vertical  directions.  This  situation  is  shown  in  figure  13.  The 
motion  is  given  as  t  =  [U,  V,  W]T  and  u>  =  [A,  B,  C]r. 


6.1.2  Inverse  Projection 

We  compute  the  3D  coordinates  of  all  the  points  whose  depth  is  stored  in  the  depth 
map  using  inverse  perspective  projection.  The  goemetry  of  perspective  projection  along 
the  x  direction  is  shown  in  figure  14.  We  compute 

Xu  =  9iZi<9V*\  and  Yu  =  (51) 

for  t  =  0, . . . ,  h  -  1  and  j  =  0, . .  • ,  w  -  1.  The  focal  length  is  denoted  by  /. 
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Image  Plana 


Figure  14:  The  perspective  projection  situation  along  the  *  direction. 

0.1.S  Warping 

We  transform  all  the  3D  points  into  the  coordinate  system  obtained  after  preforming 
the  given  motion  t,  u.  Transformed  points  will  be  denoted  by  a  prime.  We  have 


X'.. 

Vi  = 

z'a 


u 

-  V 

w 


1  -C  B 
C  1  -A 
-B  A  1 


for  *  =  0, . . . ,  h  —  1  and  j  =  0, 1. 


Surface  Z(x,y) 


mm 


Surface  Tlx, y) 


frame  in  space  according  to  the  given  motion.  The  result  can  be  visualised  as  in  figure  15. 
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6.1.4  Resampling 

Now  we  encounter  the  main  problem  in  the  prediction  algorithm.  If  we  project  the 
points  Z'ij)  back  into  the  image  plane  they  will  most  likely  not  coincide  with  the 

grid  points  as  shown  in  figure  15.  In  other  cases  the  backprojected  points  may  have  become 
occluded  by  other  points  or  have  moved  out  of  the  image  plane.  We  handle  these  problems 
as  follows:  we  interpret  each  point  {X' ,Y\  Z')  obtained  from  the  warping  step  as  a  sample 
of  a  3D  surface  and  the  objective  is  to  resample  this  surface  at  the  grid  points.  This  requires 
tin  assumption  about  the  structure  of  the  surface  inbetween  the  sample  points. 


Figure  16:  Triangular  facet  subdivision. 

We  will  consider  each  grid  square  in  the  array.  Such  a  square  is  given  by  its  corner 
coordinates  (ij),(i  +  l,j),(*J  +  1),  (*  +  1  ,j  +  1).  For  each  of  the  four  corners  there  is 
a  warped  3D  surface  point  (X{j,Yi'J,Z'})  available.  We  divide  the  grid  square  into  two 
triangles 

(i,j),(i+  l,j),(i,j  +  1)  (53) 

(i  +  1  ,j  +  l),(i  +  +  1)  (54) 

as  shown  in  figure  16.  The  three  points  in  3D  corresponding  to  these  corners  define  a  plane 
in  space  which  we  will  use  as  our  surface  approximation.  Most  importantly  the  spatial 
triangle  T  between  these  points  prespectively  projects  into  a  triangle  V  in  the  image  plane. 
All  grid  points  inside  V  have  rays  which  intersect  the  spatial  triangle  T. 
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The  value  of  Z  which  we  obtain  by  this  intersection  procedure,  however,  need  not  be  the 
truly  visible  depth  value,  since  the  currently  considered  spatial  triangle  may  be  occluded 
by  other  triangles.  But  if  we  repeat  the  procedure  of  computing  the  intersection  for  all 
possible  spatial  triangles,  then  the  smallest  value  of  Z  obtained  at  a  given  grid  point  will 
be  the  truly  visible  depth  value. 

1.  Determining  the  image  plane  projections  of  the  spatial  triangles 


For  every  spatial  triangle  T  we  have  the  3D  coordinates  of  the  corners  [Xx,  Fj,  Z\]T, 
[X2,Yi,ZiY  and  [X3, Y3, Zz]T .  It  is  intuitively  clear  and  can  be  shown  analytically, 
that  this  spatial  triangle  T  perspectively  projects  into  a  triangle  V  in  the  image  plane. 
The  corners  of  this  triangle  are  (*i,yi),  (12,1/2)  a-nd  (*3,1/3)  where 


fXk 

x‘  =  /zT 


and 


fYk 

yk  =  /T> 


(55) 


for  k  =  1,2,3. 


2.  Determining  the  grid  points  inside  the  triangle  T' 


The  rays  through  all  grid  points  ( x,y )  inside  T'  intersect  the  corresponding  spatial 
triangle  T  i.e.  the  warped  surface  and  it  is  these  intersection  points  that  we  would 
like  to  determine.  The  first  step  is  to  compute  all  grid  points  inside  V . 

We  first  determine  the  bounding  box  of  the  given  triangle.  The  *  coordinates  are 
between  xmin  and  xmax  given  by 

Xmin  =  min(zx,x2,*3)  and  xmax  =  max(xi,x2,x3)  (56) 

and  similarly  we  compute  bounds  ymi„,  ymax  on  the  y  coordinates.  This  determines  a 
square  in  the  grid.  The  grid  points  inside  this  square  have  coordinates  in  the  ranges 


*mox]  «Uld  [jmin,  jmox]  where 

«(^:+ Y> 

and 

fl  !  Xjna *  .  W  ~  1  \ 

Jmax  —  floor(  ^  ^  ) 

(57) 

and 

c  1  1  ymax  .  h  ~  1  , 

“  Ay  2 

(58) 

in  which  ceil(x)  is  the  smallest  integer  larger  than  x  and  floor(x)  is  the  largest  integer 
smaller  than  x.  In  addition,  we  must  insure  that  the  values  jmin,  jmax,  imj„  and  imax 
are  actually  grid  points  i.e.  that  they  are  in  the  range  [0,  w  - 1]  or  [0,  h.  - 1]  respectively. 
If  this  is  not  the  case,  we  set  them  to  the  closest  value  which  is  inside  the  grid. 

For  all  grid  points  with  indices  (i,j)  in  the  computed  range  we  must  now  determine,  if 
they  actually  lie  inside  the  triangle  T'.  We  can  use  a  simplified  polygon  containment 
test:  Arrange  the  vertices  of  the  triangle  in  counterclockwise  order  and  then  test  if 
the  candidate  point  lies  in  the  left  half  plane  of  all  of  the  three  edges  as  we  proceed 
in  counterclockwise  direction. 
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More  formally,  to  test 

x  =  (j  -  ^)Ax  (59) 

y  =  (*-^y^)Ay  (60) 

for  containment  in  (xj,yi)  -  (13,1/2)  -  (13,1/3)  we  first  compute 

(*3  “  -  yi)  -  (ift  -  yi)(*2  -  Xi).  (61) 

If  this  is  positive,  then  the  given  sequence  of  points  is  in  clockwise  order  and  we  switch 
(x2,i/2)  and  (X3, 1/3)  to  obtain  counterclockwise  order. 

Now  we  compute  the  three  values 


(*2  -  *i)(y  -  yi)  -  (*  -  *i)(y2  -  yi) 

(62) 

(x3  -  *2)(y  -  1/2)  -  (x  -  X2)(y3  -  y2) 

(63) 

(*i  -  X3)(y  -  !&)  -  (x  -  x3)(yi  -  y3) 

(64) 

If  all  three  values  are  positive,  the  point  (x,y)  is  inside  the  triangle.  Each  test  deter¬ 
mines  containment  in  one  left  half-plane.  For  each  spatial  triangle  T  we  thereby  obtain 
a  (possibly  empty)  set  of  image  plane  grid  points  (x,  y)  which  have  rays  intersecting 
T. 

3.  Determining  the  intersections  of  the  grid  rays  with  the  spatial  triangles 


For  each  of  the  grid  points  (x,y)  the  ray  has  the  equation 


'  X 

X 

Y 

=  A 

y 

Z 

for  positive  A.  The  plane  through  the  spatial  triangle  T  has  the  equation 


’  x  ' 

'  ' 

‘  *2  ' 

'  ' 

Y 

Z 

=  a 

Yi 

Zx 

+  0 

y2 

z2 

+  (1  -Q-/J) 

z3 

(65) 


(66) 


In  computing  the  intersection  of  the  ray  with  the  plane  we  obtain  three  linear  equa¬ 
tions  for  the  three  unknowns  a,  0  and  A.  Knowledge  of  A  is  sufficient  to  determine 
the  point  of  intersection. 

The  determinant  of  the  system  is 


D  =  (Xx  -  X3)[(Z2  -  Z3)y  -  (Y2  -  Y3)f]  -  (67) 

(Yi  -  Y3)[(Zt  -  Z3)x  -  (X2  -  X3)f]  + 

( Zl~Z3)[(Y2-Y3)x-(X2-X3)y } 
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and  the  adjunct  of  A  is 


Dx  =  -X3[(Yl-Y3)(Z2-Z3)-(Y2-Y3)(Zl-Z3)\ 
+Y3[(X  1  -  X3)(Z2  -  Z3)  -  (X2  -  X3)(Z ,  -  Z3)\ 
-Z3[{Xx  -  X3)(Y2  -  Y3)  -  (X2  -  XaXn  -  Y3)\. 

So  if  D  is  nonzero  we  have 


and  the  intersection  point  is  given  by 


'  X  ' 

X 

y 

=  A 

y 

z 

(68) 


(69) 


(70) 


The  case  where  D  =  0  is  mathematically  possible  in  two  cases,  only  one  of  which  can 
occur  here.  In  the  first  case  the  ray  and  the  plane  do  not  intersect.  This  cannot  occur 
since  we  have  selected  the  rays  in  such  a  way  that  they  must  intersect  the  triangle. 

The  second  case  is  where  the  ray  lies  in  the  plane  defined  by  the  spatial  triangle.  In 
this  case  there  are  are  infinitely  many  solutions  which  will  manifest  itself  in  D  =  0. 
In  this  case  we  would  like  to  have  the  intersection  point  with  the  smallest  value  of  Z. 
This  point  will  lie  on  one  of  the  three  triangle  edges.  We  can  therefore  compute  the 
intersection  of  the  ray  with  the  three  triangle  edge  segments  (there  must  be  at  least 
two  intersections)  and  select  the  one  with  the  smallest  depth  value. 

Each  edge  segment  of  the  spatial  triangle  is  given  by  two  points  in  space  [Xi,Yi,  Zi]T 
and  [X2,Y2,  Z2]t.  The  equation  of  the  line  is 


’  x  ' 

‘  x2' 

y 

=  a 

+  (!-<*) 

y2 

z 

Zi 

Z2 

which  we  must  intersect  with  the  ray 


'  X  ' 

X 

y 

=  A 

y 

z 

(71) 


(72) 


for  A  >  0  and  0  <  a  <  1.  This  yields  three  equations  for  the  two  unknowns  A  and  a. 
Since  we  know  (by  the  fact  that  D  =  0)  that  the  two  must  intersect,  we  can  use  any 
two  of  the  equations  and  compute  the  determinant 


d  =  x{Yx  -  Y2)  -  y(X j  -  X2)  (73) 

as  well  as  the  adjuncts 

da  =  —  yX2  +  zY2  (74) 

dx  =  (^-.v2)y2-(yi-y2)x2  (75) 


If  d  is  nonzero  we  have 


(76) 


a  = 


da 

d 


and 


A  = 


dx 

d 


Ifa<0ora>l  the  ray  intersects  the  line  outside  the  segment  between  the  two 
given  points  and  we  try  the  next  triangle  edge.  A  <  0  is  not  possible.  Having  tested 
all  these  conditions,  the  desired  intersection  is 


'  *  ■ 

X 

Y 

=  A 

y 

Z 

(77) 


Finally,  if  d  is  zero,  then  the  triangle  edge  segment  coincides  with  the  ray  and  the 
desired  point  is  the  segment  endpoint  with  the  smallest  depth  value.  In  other  words 
if  Z\  <  Z2  the  intersection  point  is  [Xi, Fj ,  Zi]T  otherwise  [A2,  Yj,  Z2]r. 


4.  Creating  the  resampled  depth  map 


Each  spatial  triangle  T  may  lead  to  depth  values  Z  for  a  number  of  different  grid 
point  locations  (z,y).  Some  grid  points  may  have  multiple  values  of  Z  assigned,  if 
the  ray  through  that  point  intersects  the  3D  surface  more  than  once.  Among  these 
points  we  are  interested  in  the  one  closest  to  the  camera,  for  it  will  occlude  all  others. 
Hence  we  simply  ignore  all  depth  values  Z  at  a  particular  point  except  the  smallest 
one. 

In  practice  we  will  initialize  the  depth  map  Zj;  with  00  (some  very  large  value)  before 
resampling  the  warped  depth  surface  as  described  above.  Whenever  a  new  value  Z  is 
computed  for  location  (i,j)  we  replace  with  Z  if  and  only  if  Z,;  >  Z.  After  each 
grid  triangle  has  been  considered,  Z,;  will  contain  only  the  minimal  values  of  depth 
computed  at  that  point. 


6.1.5  Extrapolating  to  unassigned  grid  points 


It  is  easily  conceivable  that  some  grid  points  will  not  have  been  assigned  any  value  of  Z 
after  the  resampling  step.  This  will  be  reflected  by  the  fact  that  some  depth  map  values  are 
still  00  after  all  grid  triangles  have  been  considered.  Physically  these  points  correspond  to 
surface  points  which  have  become  visible  due  to  the  relative  motion  of  camera  and  scene, 
so  that  no  depth  information  is  available  there. 

In  general,  we  can  say  nothing  about  the  correct  depth  value  at  these  locations.  If. 
however,  we  assume  that  the  surface  is  somewhat  smooth  then  we  can  extrapolate  the 
depth  values  at  these  points  from  known  values  at  neighboring  locations. 

In  our  current  algorithm,  we  search  the  depth  map  for  unassigned  locations  (where 
Z  =  00).  Suppose  (*,j)  is  such  a  grid  location.  We  then  search  all  neighboring  pixels  (with 
coordinates  differing  by  -1,  0  or  1  from  (*,  j))  for  finite  values  of  Z.  Say  we  find  n  such 
pixels.  Let  the  sum  of  the  Z  values  of  these  pixels  be  £Z.  Then  we  assign  the  value 


n 


(78) 
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at  location  (t,j).  If  all  neighboring  pixels  were  also  unassigned,  we  increase  the  radius  of 
our  search  from  1  to  2  and  try  again.  If  the  search  radius  exceeds  the  size  of  the  image 
then  none  of  the  grid  points  has  been  assigned  any  value.  This  would  mean  that  none  of 
the  previously  visible  surface  part  is  visible  in  the  current  image. 

6.1.6  Summary  of  the  algorithm 

Below  we  briefly  summarize  the  steps  of  the  depth  prediction  algorithm  using  the  no¬ 
tation  from  above 


1.  Initialize  the  reinterpolation  depth  map  to  Zij  =  oo  for  all  i,j. 

2.  Determine  the  3D  coordinates  of  each  point  stored  in  the  current  depth  map  using 
the  inverse  perspective  projection: 


v  _ 

—  j 


and 


v.  -  yj&L 

13 "  ~r 


(79) 


3.  Warp  the  3D  points  into  the  new  coordinate  frame: 
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(80) 


4.  Project  each  3D  point  back  into  the  image  plane 

and 


*  ..f*L 

»J  J  7< 
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Y' 

v'  =  f— 

Vi;  J 


(81) 


5.  Each  grid  square  (i,j)  —  («’+  l,j)  —  (*,  j  + 1)  —  (*+  1,  j  + 1)  is  divided  into  two  triangles 
(*,  j)-(i  +  l,i)-(t,J  + 1)  and  (i+l,j  +  l)-(i  + 1,  j)  —  (*,  j+  1).  For  each  such  triangle 
the  previous  step  has  produced  three  corner  points  (x\  x/)  in  the  warped  image  plane. 
We  determine  all  the  grid  points  in  the  warped  grid  which  fall  into  this  triangle.  This 
is  done  by  first  computing  the  bounding  box  of  the  triangle  and  then  testing  each 
grid  point  in  the  bounding  box  for  containment  in  the  triangle.  The  ray  through  each 
grid  point  found  to  be  in  the  triangle  will  intersect  the  corresponding  spatial  triangle. 

6.  For  each  grid  point  (i,j)  inside  of  a  triangle  we  compute  the  intersection  of  the  ray 
through  that  gridpoint  with  the  corresponding  spatial  triangle  according  to  one  of 
the  three  cases  discussed  in  subsection  3.  If  the  Z  coordinate  of  the  intersection  is 
smaller  than  the  current  value  of  Zij  we  replace  Zij  by  the  new  Z. 

7.  After  ail  grid  triangles  have  been  considered,  we  search  the  depth  map  ZtJ  for  oo.  For 
each  such  grid  point  found  we  extrapolate  Z  from  the  closest  gridpoints  with  Z  ^  oo 
as  described  above. 


Using  the  procedure  described  above  and  some  simple  facts  about  the  propagation  of 
variances  we  can  also  determine  the  variances  of  the  warped  and  resampled  depth  map. 
These  variances  are  needed  for  the  operation  of  the  recursive  estimation  procedure  in  the 
next  iteration.  Due  to  the  lengthy  nature  of  the  derivation  we  have  attached  it  as  Appendix 
A  for  the  interested  reader. 

0.2  Algorithmic  aspects  of  the  prediction  stage 

Suppose  that  the  depth  map  consists  ofnxn  values.  This  means  there  are  2(n  -  l)2 
triangular  facets  which  must  be  considered  for  backprojection  after  the  warping  stage.  In 
the  worst  case,  the  backprojection  of  each  facet  could  subsume  the  entire  image  plane  and 
therefore  necessitate  the  update  of  n2  depth  and  variance  values.  The  worst  case  complexity 
is  therefore  0(n4).  For  most  well-behaved  surfaces  and  small  motions,  however,  the  the 
backprojection  of  each  triangle  will  subsume  a  number  of  pixels  which  can  be  bounded  by 
a  small  constant. 

A  parallel  machine  could  devote  one  processor  to  the  computation  associated  with  one 
triangular  facet.  As  we  pointed  out  the  result  of  this  computation  may  effect  all  n2  pixels 
in  the  image  plane  so  0(n2)  operations  are  necessary  at  each  processor.  The  prediction 
part  of  the  algorithm  is  computationally  most  expensive. 
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7  The  Motion  Estimation  Stage 


In  our  previous  discussion  we  have  seen  that  both  the  depth  measurement  and  the 
Kalman  filter  prediction  stage  require  knowledge  of  the  parameters  of  rigid  body  motion  t 
and  u>.  In  some  cases,  the  motion  may  actually  be  known  from  some  external  source,  for 
example,  when  our  camera  is  mounted  on  a  vehicle  for  which  the  motion  can  be  measured 
or  commanded  precisely.  If  so,  this  information  can  be  used  and  the  subsequently  described 
motion  estimation  module  is  not  needed.  If,  however,  motion  information  is  not  available, 
the  motion  must  be  determined  from  the  image  sequence. 

7.1  Least  squares  motion  estimation 

As  we  pointed  out  before,  motion  estimation  using  the  brightness  change  constraint 
equation  (8)  is  not  possible  since  the  depth  Z  is  unknown.  At  this  point,  we  can  exploit 
the  iterative  nature  of  the  recursive  estimation  process.  At  every  point  in  time,  the  filter 
produces  an  estimate  of  the  depth  Zk .  We  use  this  estimate  of  the  depth  map  as  input 
to  the  motion  estimation  stage  as  shown  in  the  block  diagram  4.  The  task  of  the  motion 
estimation  stage  is  therefore:  given  a  depth  map  Z(x,y)  and  the  brightness  gradients  Ex, 
Ey,  Et  compute  the  motion  t  and  u>. 

If  the  depth  Z  is  given,  the  brightness  change  constraint  equation  (8)  contains  6  un¬ 
knowns:  the  motion  vectors  t  and  u>.  We  obtain  one  such  linear  constraint  equation  for 
every  pixel  in  our  image.  Instead  of  selecting  6  of  these  equations  and  solving  them  for 
the  desired  parameters  we  formulate  a  least-squares  problem  using  the  constraints  at  every 
pixel  within  an  image  patch  in  order  to  reduce  the  effect  of  noise  (see  Horn,  Weldon  [28]). 

More  formally,  we  have 

min££(^  +  v.u >  +  £f)2  (82) 

which  we  differentiate  with  respect  to  the  motion  vectors  to  obtain  the  necessary  condition 
for  a  minimum 

(EE^'+TEt1"  =  -EEt  m 

£Ey)t  +  (EEvvT)w  =  (84) 

x  y  x  y  x  y 

The  above  summations  are  carried  out  over  all  pixels  in  the  image  region  of  interest.  This 
linear  system  (83),  (84)  of  6  equations  can  be  easily  computed  from  the  brightness  gradients 
and  the  given  depth  Z.  It  is  noteworthy  in  this  context  that  the  system  matrix  is  symmetric 
so  that  the  number  of  coefficients  to  compute  can  be  halved.  Any  standard  technique  (see 
[52])  can  be  used  to  solve  the  linear  system. 

7.2  Evaluating  the  motion  estimation  procedure 

A  crucial  point  is  of  course  the  fact  that  the  depth  map  used  in  the  motion  estimation 
is  not  the  true  depth  map  but  the  depth  map  produced  by  the  filter,  which  is  subject  to 
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errors.  An  important  issue  is  therefore  how  the  motion  estimation  procedure  performs  in 
the  presence  of  errors  in  the  depth  map.  This  is  difficult  to  evaluate,  since  true  depth  maps 
for  scenes  are  rarely  available.  This  is  where  the  experiments  on  planar  scenes  come  in 
handy.  Planar  depth  maps  are  trivial  to  create  and  the  true  depth  of  the  experimental 
scene  can  be  measured  quite  accurately.  Moreover  the  error  in  a  typical  depth  estimate  as 
it  is  produced  by  the  patch  depth  estimation  procedure  is  approximately  Gaussian  (figure 
8)  and  can  therefore  be  recreated  easily  in  an  experiment. 

We  used  the  scenes  “wave”  and  “onnanoko”  shown  in  figures  6  a)  and  b)  of  section 
3.2.  In  each  case  we  have  frontal  planes  at  700  mm  and  500  mm  respectively.  We  created 
depth  maps  containing  these  values  at  every  point  (x,y)  with  additive  Gaussian  noise  of 
varying  standard  deviation.  We  then  used  these  depth  maps  and  the  brightness  gradients 
computed  from  the  images  to  the  motion  estimation  procedure  described  above. 


6wr  h  TimWIff  (*Mk 
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Figure  L7:  Relative  error  in  the  motion  estimate  vs.  standard  deviation  of  the  depth  error 
(normalized  by  true  depth)  a)  wave  scene,  b)  onnanoko  scene.  Both  axes  are  labeled  in 
percent. 


We  evaluate  the  error  in  the  resulting  motion  estimate  using  the  measure  (both  motions 
were  pure  translations): 


£  = 


|t-t| 

|t| 


(85) 


where  t  is  the  estimate  obtained  for  the  translation  vector  and  t  is  the  true  value.  This  is 
the  normalized  error  in  the  vector.  The  measure  s  was  computed  using  depth  maps  with 
noise  of  standard  deviation  0, 1,  5,  10,  50,  100,  200  and  300  mm.  The  resulting  values  of  the 
error  in  the  motion  as  a  function  of  the  noise  standard  deviation  (normalized  by  the  true 
depth)  are  shown  in  figures  17  a)  and  b)  for  the  wave  and  onnanoko  scenes  respectively. 

Both  results  show  that  25%  noise  results  in  about  20%  error  in  t/e  motion  estimate. 
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For  the  onnanoko  scene  with  a  true  depth  of  500  mm,  for  instance,  this  means,  that 
when  68  %  of  all  depth  values  are  in  the  range  [375,625],  a  motion  estimate  of  t  = 
[1.419762,0.046511, -2.713889]7  (vs.  the  actual  t  =  [1.5,0, -3. 0]T)  is  obtained.  This  indi¬ 
cates  that  the  motion  estimation  procedure  is  quite  robust  with  respect  to  errors  in  depth. 

This  robustness  of  the  motion  estimator  is  extremely  important  for  the  convergence  of 
the  filtering  routine.  In  the  filtering  procedure,  the  largest  errors  in  depth  occur  initially, 
when  no  information  about  depth  is  available.  Initially,  we  start  with  a  fiat  depth  map  and 
may  potentially  have  large  errors  in  depth.  Robustness  of  the  motion  estimator  is  essential 
to  prevent  divergence  at  that  point. 

The  motion  estimator  can  be  viewed  from  another  perspective.  The  motion  vectors  are 
parameters  in  the  Kalman  filter  formulation.  Adding  the  motion  estimator  is  equivalent 
to  fitting  these  parameters  to  the  data  in  a  least  squares  fashion  in  every  iteration.  The 
proposed  approach  is  therefore  a  least-squares  adaptive  filter.  A  distinct  advantage  of  this 
approach  is  the  fact  that  no  particular  care  must  be  taken  in  the  choice  of  the  initial  depth 
value,  since  the  motion  estimator  automatically  adjusts  the  motion  parameters  to  fit  the 
depth  data  by  exploiting  the  motion-depth  ambiguity.  The  initial  depth  value  does  not 
determine  whether  or  not  the  algorithm  converges,  it  merely  sets  the  absolute  values  of 
motion  and  depth  values  it  will  eventually  converge  to.  If  fixed  (known)  motion  is  used, 
this  is  not  true. 

The  above  experiments  were  carried  out  using  purely  translational  motion  only.  Of 
course  the  formalism  holds  for  arbitrary  rotational  and  translational  motion.  However,  we 
discovered,  that  the  motion  estimation  does  not  exhibit  the  necessary  robustness  discussed 
above,  when  all  6  degrees  of  motion  are  admitted.  This  does  not  mean  that  rotational 
motion  cannot  be  handled.  Our  experiments  showed  similarly  stable  results  as  shown 
above  in  the  case  of  one  rotational  and  2  translational  degrees  of  freedom  (as  in  a  mobile 
robot  for  example).  For  our  presentation,  we  chose  to  restrict  our  experiments  to  purely 
translational  motion  only. 

7.3  Algorithmic  aspects  of  the  motion  estimation  process 

The  complexity  analysis  is  quite  simple.  Each  sum  in  the  formulas  (83),  (84)  contains 
0(n2)  elements  so  that  0(n2)  operations  will  be  necessary.  Speedups  can  be  achieved  by 
noting  that  the  system  matrix  in  these  equations  is  symmetric.  On  a  parallel  machine  the 
summations  can  be  carried  out  in  O(log(n))  time  using  a  tree  connection  structure  of  the 
individual  processors. 
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8  Smoothing  and  Filling  In 

8.1  High  variance  regions:  the  need  for  surface  reconstruction 

The  adaptive  recursive  estimation  scheme  described  so  far  has  the  following  limitation: 
Given  a  sequence  of  measurements,  it  will  choose  an  estimate  that  best  matches  the  mea¬ 
surements  according  to  their  quality  as  indicated  by  the  variances.  Conceptually,  we  start 
with  some  initial  value  for  the  depth  estimate  and  each  measurement  pulls  the  estimate 
towards  the  true  value.  The  smaller  the  variance,  the  stronger  the  pull. 

If,  however,  the  measurements  at  a  given  point  (x,y)  in  the  image  plane  always  exhibit 
a  high  variance,  then  the  estimate  will  never  move  from  its  initial  value.  In  essence,  no 
information  about  the  actual  value  is  supplied  to  the  filter  so  the  estimate  is  never  changed. 
A  practical  situation  that  demonstrates  this  is  the  pepsi  scene.  Shown  in  figure  18  are  an 
image  from  the  pepsi  sequence,  the  depth  map  and  the  logarithm  of  the  variance  map. 


Figure  18:  a)  Pepsi  scene  b)  Depth  map  computed  with  patch  estimate  (size  9)  and  c) 
logarithm  of  variances. 

As  we  have  already  noted  in  section  3.2,  the  lower  right  portion  of  the  pepsi  can  has  large 
regions  of  uniform  brightness.  Large  errors  in  the  depth  values  result  and  the  regions  have 
high  variance  values  (high  brightness  in  the  variance  map).  Moreover,  the  measurements 
obtained  in  these  regions  exhibit  high  variances  throughout  the  entire  sequence  i.e.  there  is 
never  any  useful  information  available  for  depth  estimation  at  these  points. 

From  the  recursive  estimation  point  of  view,  this  cannot  be  helped.  It  is,  however, 
rather  unsatisfactory  as  the  result  of  an  image  analysis  algorithm.  To  improve  this  situation 
we  assume  that  the  surface  being  observed  exhibits  some  amount  of  smoothness  so  that 
"bad”  values  may  be  filled  in  by  neighboring  "good”  values.  In  essence,  we  have  a  surface 
reconstruction  problem.  Note  that  the  surface  reconstruction  is  completely  orthogonal  to 
the  dynamic  motion  vision  algorithm  from  the  estimation  theoretical  point  of  view.  Its 
effects  are  an  improvement  of  the  obtained  depth  maps  that  make  them  more  intuitively 
plausible. 
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8.2  MRF  surface  reconstruction  and  why  it  fails 

Reconstructing  smooth  surfaces  is  a  problem  that  has  been  dealt  with  extensively  and 
we  do  not  attempt  to  make  a  contribution  to  this  field.  Naive  solutions  such  as  smoothing 
or  lowpass  filtering  the  depth  map  fail  when  discontinuities  are  present  (blurring  of  edges). 
Mathematical  foundations  for  smooth  surface  reconstruction  in  the  presence  of  disconti¬ 
nuities  were  laid  by  Geman  and  Geman  [18].  Work  by  Grimson  [19],  Terzopoulos  [48], 
Marroquin  [33]  and  Blake  and  Zisserman  [6]  explored  applications  to  vision.  The  stochastic 
relaxation  algorithms  used  there  are  computationally  quite  intensive.  Geiger  and  Girosi 
[16]  recently  developed  a  deterministic  approximation  to  Markov  random  fields  based  on 
mean  field  theory  which  is  computationally  quite  simple. 

We  briefly  summarize  the  results  from  [16]  as  they  apply  to  our  surface  reconstruction 
application.  Given  a  field  Zij  on  a  regular  lattice  of  points  ( i,j )  we  seek  to  determine  a 
field  Zij  which  is 

1.  "close”  to  the  input  data  by  some  measure, 

2.  locally  "smooth”  by  some  measure  except  at  locations  where  Zij  exhibits  discontinu¬ 
ities. 

Formally,  these  measures  of  "closeness”  and  "smoothness”  are  captured  by  an  "energy 
function”  which  the  desired  surface  function  Zij  should  minimize 

E  =  ^clote  4*  E smooth  +  Eline  (86) 

where 

E close  =  )  '( Zil  —  Zij)2  (87) 

Esmooth  —  Q  yi[(Z,-H.)  —  Zjj)2(  1  —  hg)  +  (Zj.i+1  —  Zjj)2(  1  —  V|i)]  (88) 

•J 

Eline  =  7  ^(fiij  +  Vij)  (89) 

The  first  term  enforces  "closeness”  as  before,  the  second  "smoothness”  except  at  locations 
where  the  “line  processes”  h  or  v  are  1  (at  discontinuities)  and  the  third  term  is  needed  to 
prevent  line  processes  from  being  created  everywhere.  Line  process  fields  have  value  ranges 
[0, 1]  at  every  site  (t,  j)  where  1  indicates  the  presence  and  0  the  absence  of  a  discontinuity 
at  that  location. 

Using  mean  field  theory,  Geiger  and  Girosi  derived  the  following  iterative  scheme  to 
determine  the  field  Zij  and  the  line  processes  v,j,  hij  that  minimize  the  above  energy 
functional: 


=  Z?j-u[Z?j-F(Z,Zn,hn,vn,i,j )] 

(90) 

=  H{P,hn,i,j) 

(91) 

»;+1 

(92) 
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where 


F(Z,Z,h,v,i,j) 


V{f,v,i,j) 


Zij  - 

«[(4  —  Zij- j)(l  —  Wij-i)  —  (Zi,j+\  —  Zitj)(  1  —  Vjj)  + 
(Zij  —  —  h,_i  j)  —  (Zi+ij  —  Zij)(  1  —  hij )] 

_ 1 _ 

1  e0h-a(2i+ij-2i })2) 

_ 1 _ 

1  +  e0(-r-a(2i,,+i -£,,)*) 


(93) 

(94) 

(95) 


and  n  is  the  iteration  index. 

The  parameters  a,  0,  7  and  u  play  different  roles  in  this  formulation.  The  values  of 
0  and  values  of  u>  were  set  to  0  =  10  and  u  =  0.01  as  they  were  found  to  influence  the 
convergence  behavior,  but  not  the  final  result.  The  relationship  between  a  and  7  determines 
the  tradeoff  between  smoothing  and  line  processes.  Geiger  and  Girosi  show  that  for  absolute 
gradient  values  larger  than  ^7/0  line  processes  will  be  created  to  inhibit  smoothing.  7  and 
a  should  therefore  be  chosen  so  that  \Zf/a  corresponds  to  a  threshold  value  above  which 
one  wishes  to  introduce  a  discontinuity  and  prevent  smoothing. 


Figure  19:  The  depth  map  for  the  pepsi  scene  after  one  iteration  of  the  filter.  The  patch 
depth  estimator  of  size  9  and  the  motion  estimator  were  used.  The  initial  depth  map  is  flat 
1000  mm. 

We  have  implemented  the  above  iterative  solution  to  the  mean  field  approximation  to 
an  MRF  for  smoothing.  Figure  19  shows  the  depth  map  result  of  the  update  stage.  Figures 
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Figure  20:  The  depth  maps  from  the  pepsi  scene  with  a)  7  =  160000,  b)  7  =  250000  and 
c)  7  =  360000  (see  text). 

20  a),  b)  and  c)  show  the  result  of  applying  the  MRF  algorithm  for  3  iterations  with 
parameters 

a)  a  ss  4,  7  =  160000  (edge  threshold  =  200) 

b)  a  =  4,  7  =  250000  (edge  threshold  =  250) 

c)  a  ss  4,  7  ss  360000  (edge  threshold  =  300) 

The  input  field  range  is  [459,2967],  Figures  21  a),  b)  and  c)  show  the  line  processes  as 

=  hij  +  Vij  (96) 

where  black  dots  indicate  the  location  of  a  boundary  and  white  points  are  locations  where 
smoothing  was  enabled. 

Only  few  iterations  were  used  but  they  suffice  to  reveal  a  very  important  fact.  While 
the  depth  maps  are  virtually  identical  to  the  observer,  the  discontinuity  map  shows  that 
strong  discontinuities  are  established  along  the  regions  of  constant  brightness  on  the  can 
surface.  They  prohibit  information  from  propagating  into  the  regions  of  high  variance.  So 
while  smoothing  is  achieved  the  filling  in  cannot  be  accomplished  with  the  MRF  scheme, 
because  variance  information  is  not  incorporated. 

8.3  Filling  in  high  variance  regions 

One  could  devise  various  schemes  to  incorporate  the  variance  information  into  the  MRF 
approach  in  particular  by  varying  a  and  7  spatially  according  to  the  variance.  This,  however 
is  not  straightforward  and  leads  to  a  significant  complication  of  the  algorithm.  Instead,  we 
found  a  very  simple  procedure  to  be  useful  in  solving  the  problem  of  filling  in  information. 

Shown  in  figure  22  are  the  variance  (not  in  logarithmic  scale)  corresponding  to  the  up¬ 
dated  depth  map  of  figure  19  and  a  histogram  of  these  variances.  We  observe  (this  observa¬ 
tion  holds  for  all  the  variance  maps  we  investigated)  that  the  distribution  is  exponentially- 
shaped  with  the  majority  of  the  values  concentrated  at  the  lower  end.  In  fact,  over  90  % 


Figure  21:  The  line  processes  from  the  pepsi  depth  map  with  a)  7  =  160000,  b)  7  =  250000 
and  c)  7  =  360000  (see  text). 
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Figure  22:  The  variance  map  after  update  and  a  histogram  of  the  variances. 


of  the  values  are  between  the  min  and  min  +a  where  a  is  the  standard  deviation  of  the 
distribution  and  min  is  the  smallest  variance  value  observed.  In  other  words:  "bad”  or  high 
variance  values  can  be  clearly  identified  from  the  distribution. 

Having  determined  such  a  threshold  on  the  variances  we  fill  in  depth  values  that  exhibit 
a  variance  value  above  the  threshold  by  repeating  the  following  ’’brushfire-fUT’  algorithm, 
until  all  variance  values  are  below  the  threshold: 

1.  Initially,  set  =  p,;  and  Z\}  —  ZtJ  for  all  (*,j)  where  Z  and  p  are  the  input  depth 
and  its  variance. 


2. 


For  all  sites  (»,  j)  that  have  variance  value  pij  above  the  threshold  set  the  new  variance 
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Figure  23:  a)  The  result  of  filling  in  the  updated  depth  map  with  threshold  min+<r,  b) 
MRF  smoothing  of  filled  in  depth  map,  c)  line  processes 
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Figure  24:  Line  number  150  of  a)  the  update  depth  map,  b)  the  filled  in  depth  map,  c)  the 
filled  and  MRF-smoothed  depth  map. 

value  j/tj  to  be  the  average  of  the  variances  at  the  eight-connected  neighbors  that  are 
below  the  threshold  and  the  new  depth  to  be  the  average  of  the  depth  values  Z,; 
at  those  sites. 

3.  Copy  pij  —  p[:  and  Zi3  =  Z[3  for  all  Continue  with  step  2. 

The  effect  will  be  that  in  each  step  information  will  propagate  by  one  pixel  from  low 
variance  regions  into  high  variance  regions  until  all  low  variance  information  has  been  filled 
in.  Figure  23  a)  shows  the  result  of  filling  in  when  a  variance  threshold  of  min  +<7  =  255000 
is  used.  The  improvement  over  the  input  depth  map  19  and  all  the  MRF  depth  maps  20 
is  clearly  visible.  Figure  23  b)  and  c)  show  the  result  of  processing  the  filled  in  depth  map 
with  the  MRF  smoother  (Parameters  a  =  4,  0  =  10,  7  =  250000,  u  =  0.01). 

A  more  detailed  illustration  of  the  effect  of  filling  in  and  smoothing  is  provided  by 
examining  values  along  a  cross-section  of  the  depth  map.  Figure  24  a)  shows  the  values 
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along  line  number  150  of  the  updated  depth  map  of  figure  19.  This  line  is  precisely  3/4 
from  the  top  of  the  depth  map.  Figure  24  b)  shows  the  same  line  after  the  filling  operation 
and  figure  24  c)  shows  the  line  after  MRF-smoothing  the  filled-in  depth  map.  We  see 
how  the  filling-in  step  significantly  improves  the  depth  estimate  in  the  center  of  the  line 
corresponding  to  the  can.  The  MRF  smoothing  step  then  attenuates  local  variation  while 
clearly  preserving  the  major  discontinuity. 

This  filling-in  operation  followed  by  MRF  smoothing  as  described  is  available  an  optional 
"add-on”  in  our  implementation  of  the  dynamic  structure  and  motion  estimation  algorithm. 
It  is  not  a  necessary  component.  As  we  have  demonstrated,  however,  it  is  important  to 
take  care  in  performing  filling  or  smoothing  operations  when  the  quality  or  “information 
content”  varies  among  the  field  values  under  consideration. 

8.4  Algorithmic  aspects  of  the  filling/smoothing  stage 

The  filling  stage  requires  at  least  one  value  in  the  input  depth  map  to  have  variance 
above  the  threshold.  In  the  worst  case,  this  depth  value  could  be  located  in  a  corner  of  the 
depth  map  in  which  case  it  would  take  n  —  1  iterations  to  fill  the  entire  depth  map.  In  each 
iteration  n2  sites  are  checked  so  that  the  worst-case  complexity  is  0(n3).  This  scenario, 
however,  is  rather  unrealistic.  In  practice  the  size  of  the  largest  high  variance  region  can 
be  bounded  by  some  small  number  k  and  the  complexity  is  0(n2k).  On  a  parallel  machine 
the  filling  operation  can  be  accomplished  in  0(log(k))  time. 

The  complexity  analysis  of  the  MRF  operation  is  quite  intricate  in  general  since  conver¬ 
gence  becomes  an  issue.  If,  however,  we  limit  ourselves  (as  we  have  done)  to  a  fixed  number 
of  iterations  (say  m)  the  complexity  is  easily  seen  to  be  0(n2m)  on  a  serial  machine  and 
O(m)  on  a  parallel  machine. 
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9  Experiments 


In  the  previous  sections  we  have  seen  the  effect  of  the  individual  modules  that  constitute 
the  dynamic  motion  vision  algorithm.  In  this  section  we  evaluate  the  complete  estimation 
system  on  several  real  image  sequences.  The  goal  of  this  experimental  evaluation  is  to  de¬ 
termine,  where  possible,  the  amount  of  error  in  the  depth  and  motion  estimates  produced 
by  the  algorithm  and  to  investigate  the  amount  of  improvement  that  is  obtained  by  this 
multiframe  algorithm  over  a  two-frame  algorithm.  Since  true  depth  maps  are  rarely  avail¬ 
able  for  real  scenes,  the  depth  estimates  must  be  assessed  in  a  more  qualitative  manner  by 
the  reader. 


Figure  25:  The  wave  experiment  scene 


9.1  The  “wave”  experiment 

To  obtain  measurable  results  from  both  depth  and  motion  estimation  we  used  a  simple 
scene  with  a  planar  surface  parallel  to  the  image  plane.  The  scene  is  shown  in  figure  25. 
The  camera  translated  by  t  =  [1.5, 0,3.0]  mm  between  frames.  Initial  values  of  Z0  —  1000 
and  po  =  10002  were  used.  Figures  26  a)  and  b)  show  3D  structure  plots  of  the  depth 
maps  after  the  1st  and  10th  filter  iteration  respectively.  In  both  cases  the  depth  range 
shown  is  [767, 1499]  which  is  the  range  of  depth  values  after  1  iteration.  No  filling-in  or 
MRF  smoothing  was  used  in  order  to  investigate  the  effect  of  the  filter  alone.  The  noise 
reduction  achieved  by  use  of  the  filter  is  clearly  visible. 

Due  to  the  simplicity  of  the  experiment,  we  can  quantify  the  effect  of  the  filter.  Since 
the  scene  is  a  frontal  plane,  the  ideal  depth  map  contains  constant  values  everywhere.  In 
practice,  the  depth  values  will  deviate  from  the  true  constant  value.  The  effect  of  the  filter 
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Figure  26:  Structure  plots  of  the  depth  maps  from  the  wave  scene  after  a)  1  iteration,  b) 
10  iterations  of  the  filter.  The  depth  range  shown  is  [767, 1499]  in  both  cases. 
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Figure  27:  Statistical  analysis  of  the  depth  maps  from  the  wave  scene  after  a)  1  iteration, 
b)  9  iterations  of  the  filter. 


should  be  to  reduce  the  standard  deviation  of  the  distribution  of  depth  alues  over  time. 
Figure  27  a)  and  b)  show  the  distributions  of  depth  values  after  1  and  9  iterations  respec¬ 
tively,  both  shown  in  the  range  [767, 1499]  for  comparison.  As  we  see  the  expected  variance 
reduction  has  been  achieved.  More  precisely,  the  variance  is  halved  after  9  iterations. 

To  see  the  development  over  time  we  have  plotted  the  variance  of  the  values  in  the 
depth  map  over  the  frame  number  in  figures  28.  The  filter  performs  satisfactorily  and  as 
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Figure  28:  Development  of  variance  in  the  depth  map  as  a  function  of  the  frame  number. 

expected  under  all  evaluation  criteria.  From  this  experiment,  it  is  hard  to  evaluate  the 
convergence  or  robustness  of  the  proposed  method,  since  the  initial  depth  map  happens  to 
be  the  true  depth  map.  The  following  experiment  explores  this  issue. 

9.2  The  “pepsi”  experiment 

In  this  experiment,  a  soda  can  was  placed  on  a  table  at  570  mm,  before  a  background 
parallel  to  the  image  plane  at  1240  mm  as  shown  in  figure  29.  The  camera  translated 
t  =  [1.5, 0,0]  mm  between  frames.  As  we  have  pointed  out  in  previous  sections,  this  scene 
poses  a  number  of  difficulties  since  it  contains  large  regions  of  uniform  brightness  as  well 
as  depth  discontinuities.  Initial  values  of  Zq  =  1000  and  po  =  10002  were  used. 

Figures  30  a),  b)  and  c)  show  the  depth  maps  obtained  after  1,  4,  and  8  iterations  of 
the  algorithm.  In  these  depth  maps,  brightness  is  proportional  to  depth  (close  values  are 
dark,  distant  values  are  bright).  MRF  smoothing  with  3  iterations,  a  =  4,7  =  250000 
and  filling  in  of  high-variance  regions  were  applied  as  described.  These  depth  maps  show 
how  the  estimate  improves  over  time,  how  the  shape  of  the  can  becomes  more  distinct  and 
the  noise  in  the  background  is  attenuated.  Repeated  application  of  the  filling-in  operation 
supplies  regions  of  high  variance  with  plausible  values.  Even  the  slanted  surface  of  the  the 
table  on  which  the  can  stands  is  recovered. 

Figures  31  a)  and  b)  show  the  3d  structure  plots  of  the  depth  maps  after  1  and  8 
iterations.  Note  that  the  bottom  border  of  the  depth  map  corresponds  to  the  top  left 
border  in  the  structure  plot  so  that  the  top  of  the  can  is  visible  to  the  observer.  These  plots 
show  how  the  shape  of  the  scene  gradually  emerges  from  the  noise.  The  filled  in  regions  can 
be  made  out  in  the  final  structure  as  they  still  differ  slightly  from  the  surrounding  areas. 


Figure  29:  The  pepsi  scene 


Figure  32  shows  the  components  of  the  translation  vector  computed  during  the  dynamic 
motion  vision  procedure  for  the  pepsi  scene.  Since  the  true  depth  map  is  not  available,  it  is 
not  possible  to  scale  the  recovered  translation  vector  to  compare  it  with  the  actual  motion. 
Instead,  we  scale  the  recovered  vectors  to  be  unit  vectors  and  determine,  how  far  the 
recovered  motion  deviates  from  pure  x  translation  direction.  We  see  that  the  error  in  all 
components  is  in  the  range  of  2  -  3  %  except  in  the  first  iteration.  This  is  plausible,  since 
the  initial  depth  is  flat  and  differs  considerably  from  the  true  depth.  This  result  for  the 
motion  estimation  confirms  the  robustness  of  the  motion  estimation  stage  with  respect  to 
errors  in  depth. 

Finally  we  use  the  pepsi  experiment  to  compare  the  adaptive  filtering  approach  to  the 
case  of  known  motion.  If  the  true  motion  of  t  =  [1.5, 0,0]  mm  is  known,  the  motion 
estimation  stage  can  be  bypassed  and  the  filter  is  non- adaptive.  For  the  same  parameter 
settings  as  used  to  obtain  the  depth  maps  of  figure  30  the  filter  produces  the  depth  maps 
shown  in  figure  33  a),  b)  and  c)  after  1,  4  and  8  iterations  respectively.  They  do  not  differ 
significantly  from  the  results  of  the  adaptive  filter,  in  fact,  the  adaptive  results  seem  to  be 
“better”. 

9.3  The  “cup”  experiment 

In  this  experiment  a  complex  scene  was  constructed.  A  cup,  a  staple- re  mover  and 
several  books  were  placed  on  a  table  before  a  flat  background.  The  layout  of  the  scene  is 
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Figure  30:  The  depth  maps  from  the  pepsi  scene  after  a)  1  iteration,  b)  4  iterations,  c)  8 
iterations  of  the  algorithm. 

shown  in  figure  34  a)  and  figure  34  b)  shows  an  image  from  the  sequence  taken  by  a  camera 
translating  t  =  [2,0,4]  mm  between  frames. 

Figures  35  a),  b)  and  c)  show  the  depth  maps  computed  by  the  dynamic  motion  vision 
algorithm  after  1,  4  and  8  iterations  respectively.  The  same  initial  values  Zo  =  1000  and 
po  =  10002  as  in  the  previous  experiments  were  used.  MRF  smoothing  with  2  iterations 
and  parameters  a  =  4,  7  =  40000  (edge  threshold  magnitude  100)  and  filling-in  were  in 
effect. 

Figure  36  a)  shows  the  components  of  the  recovered  translation  vector  as  a  function  of 
the  frame  number.  As  before  the  translation  vector  was  normalized  to  be  a  unit  vetor.  The 
true  translation  vector  in  this  diagram  is  therefore  [0.447,0,0.894]. 
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Figure  31:  Structure  plots  for  the  pepsi  scene  after  a)  1  iteration,  b)  8  iterations. 


Finally,  figure  37  shows  the  line  processes  produced  by  the  MRF  smoother  in  the  9th 
iteration  of  the  algorithm. 

In  evaluating  the  experimental  results  we  find  that  the  algorithm  recovers  the  structure 
of  all  elements  in  the  scene.  Even  the  staple- remover  in  the  left  foreground  is  recovered, 
despite  the  fairly  uniform  brightness  in  that  image  region.  The  bottom  region  of  the  cup 
is  not  only  uniform  in  brightness  but  also  contains  a  specularity.  The  filling  algorithm 
identifies  this  region  of  high  variances  and  fills  in  plausible  information.  Now  look  at  the 
top  of  the  cup  in  the  image.  Barely  visible  is  a  white  spoon  which  was  placed  in  the  cup. 
It  appears  clearly  in  the  depth  map.  It  seems  to  have  widened.  This  can  be  explained  by 
the  fact  that  we  are  using  patches  of  9  pixels  to  recover  depth,  so  that  a  certain  growth 
of  small  regions  across  depth  discontinuities  should  be  expected.  The  results  from  this 
scene  illustrate  the  practical  applicability  of  the  approach  and  rival  results  obtainable  from 
stereo. 
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Figure  32:  The  components  of  the  translation  vector  estimate.  The  translation  vector  has 
been  scaled  to  be  a  unit  vector. 

10  Conclusion  and  Future  Work 

We  have  presented  a  multiframe  algorithm  for  the  dense  estimation  of  structure  and 
motion  information.  Key  features  of  this  direct  dynamic  motion  vision  method  are 

1.  A  dense  structure  or  depth  map  is  computed. 

2.  The  parameters  of  rigid  body  motion  are  computed. 

3.  No  restriction  or  assumption  is  made  concerning  the  structure  of  the  scene 

4.  No  restriction  or  assumption  is  made  conderning  the  motion  other  that  the  resulting 
motion  in  the  image  plane  is  limited  to  only  a  few  pixels. 

5.  The  algorithm  operates  iteratively  over  a  sequence  of  arbitrary  length  and  produces 
an  updated  depth  estimate  which  best  incorporates  information  from  all  preceding 
frames  according  to  the  quality. 

6.  The  algorithm  is  based  on  the  Kalman  filter  recursive  estimation  technique  and  be¬ 
comes  adaptive  when  motion  estim  ition  is  included. 

7.  Using  a  filling  and  smoothing  algorithm,  plausible  structure  estimates  can  be  obtained 
even  in  image  regions  where  no  information  is  available. 


Figure  33:  The  depth  maps  from  the  pepsi  scene  after  a)  1  iteration,  b)  4  iterations,  c)  8 
iterations  of  the  algorithm  when  motion  is  known. 

8.  No  time-consuming  computation  of  optical  flow  or  feature  matches  is  required.  The 
algorithm  has  great  potential  for  real-time  implementation  as  all  of  the  operations 
involved  are  local. 

9.  The  algorithm  provides  robust  results  for  both  structure  and  motion  as  shown  on  a 
variety  real  images. 

The  key  assumption  we  have  made  is  the  brightness  change  constraint  assumption  (7). 
We  have  in  fact  shown,  that  its  validity  at  individual  pixels  is  not  guaranteed  at  all,  but 
that  reasonable  information  may  be  obtained  from  this  assumption  when  image  patches  are 
considered.  In  combination  with  the  recursive  estimation  procedure  for  error  attentuation 


Figure  34:  a)  The  layout  of  the  cup  experiment  and  b)  an  image  from  the  cup  sequence. 
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Figure  35:  a)  The  depth  maps  computed  by  the  algorithm  for  the  cup  scene  after  a)  1 
iteration,  b)  4  iterations,  c)  9  iterations. 
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Figure  36:  U,  V  and  W  components  of  the  recovered  translation  vectors  as  functions  of  the 
frame  number.  Translation  vectors  have  been  normalized  to  unit  vectors.  True  unit  motion 
is  [0.447,  0,  0.894]. 


Figure  37:  The  line  processes  produced  by  the  MRF  smoother  in  the  last  iteration  of  the 
dynamic  structure  estimation  algorithm. 

over  multiple  frames  the  experimental  results  seem  to  confirm  the  validity  of  this  assumption 
in  the  given  application. 

A  limitation  is  the  fact  that  only  one  independently  moving  rigid  body  may  be  visible 
in  the  scene.  To  overcome  this,  it  would  be  necessary  to  segment  the  image  into  regions 
corresponding  to  independently  moving  objects  and  then  run  the  algorithm  on  each  region 
individually.  An  idea  that  we  are  currently  pursuing  is  to  integrate  segmentation  into  the 


dynamic  estimation  process.  The  line  processes  available  from  the  MRF  smoothers  are  a 
first  step  in  this  direction,  but  more  work  is  necessary. 

An  alternative  to  the  proposed  iterative  method  is  a  global  or  batch  approach  which 
collects  information  from  the  entire  sequence  before  estimating  motion  and  structure.  As 
we  have  mentioned,  such  methods  should  exhibit  higher  accuracy  and  should  be  evaluated 
in  comparison  with  the  presented  approach. 

The  most  interesting  aspect  for  future  work  lies  in  the  following  observation.  The  depth 
update  procedure  which  incorporates  information  from  a  new  measurement  reads  (eq.  46) 

a  Zk/Pk  4-  Zk/Pk 

k  1/Pfc  +  1  /Pk 

Suppose  measurements  of  depth  are  available  from  several  independent  sources  such  as 
stereo,  shape  from  shading,  motion  or  even  non-visual  sensors.  If  each  such  measurement  Zi 
is  accompanied  by  an  estimate  of  its  quality  (variance)  p,,  then  they  can  all  be  incorporated 
into  the  current  depth  estimate  in  a  straightforward  way 

n 

Zk/Pk  +  T.Zik/Pik 

Zk - *=? -  (98) 

1/Pfc  +  £l/p.* 
i-1 

This  is  the  basis  for  a  theory  of  temporal  integration  which  we  intend  to  investigate  by 
combining  depth  information  from  various  visual  sources  over  time  which  has  great  potential 
for  future  work. 


(97) 
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A  Appendix:  Warping  the  variance  values  in  the  prediction  stage 

It  remains  to  determine  the  variance  values  of  the  warped  depth  map.  In  essence,  the 
warped  values  of  Z  are  some  function  of  the  input  values  of  Z  i.e.  the  output  value  is  a 
random  variable  which  is  some  function  of  several  input  random  variables.  More  formally: 
we  interpret  the  given  values  of  Z(xj,y,)  as  normally  distributed  random  variables  with 
variance  p(xj, y,).  What  are  the  value  of  p(ij, y,)  after  the  warping? 

A.l  Variance  Propagation 

Let  us  first  establish  some  basic  facts  about  propagation  of  variances.  Let  Zi,Z2  be 
two  uncorrelated  random  variables  with  variances  Then  the  random  variable 

Z  =  aZt  +  bZ2  (99) 

has  the  variance 

a\  =  a2cr2Zl  +  62cr|2  (100) 

The  only  assumption  made  here  is  that  Z\ ,  Z2  are  uncorrelated. 

In  the  more  general  case  Z  can  be  an  arbitrary  function  of  several  random  variables, 
say 

Z  =  f(ZuZ2).  (101) 

In  this  case  we  approximate  /  by  its  Taylor  series  around  the  point  of  interest  and  neglect 
all  but  the  first-order  terms.  Then  we  apply  the  above  rule  for  a  linear  combination  of 
random  variables  to  obtain 

^  =  (^)2^,+(^)2<T^  (102) 

where  the  derivatives  must  be  evaluated  at  the  particular  point  (Z\,Z2)  of  interest.  This 
relationship  is  easily  extended  to  n  independent  variables.  The  assumptions  made  here  are 
zero  correlation  between  Zi  and  Z2  and  the  fact  that  /()  can  be  approximated  by  the  first 
terms  of  its  Taylor  series  near  Z2,Z2. 

A. 2  Variances  of  the  warped  depth  values 

Having  established  how  variances  propagate  through  functions  we  need  merely  deter¬ 
mine  the  functional  relationship  between  input  and  output  depth  values  and  propagate  the 
variances  through  these  functions.  Let  us  determine  where  in  our  algorithm  we  actually 
compute  an  output  value  of  Z  and  how  it  is  computed. 

There  are  two  ways  in  which  an  output  value  of  Z  can  be  computed.  The  first  is  in  step 
6  of  the  algorithm.  There  we  compute  the  intersection  of  the  ray  through  a  pixel  location 
(x,y)  with  a  spatial  triangle.  The  spatial  triangle  is  determined  by  the  3D  coordinates  of 
the  corner  points  (X',  Y-,  Z()  for  »  =  1,2,3.  So  our  output  value  of  Z  is  some  function 
(which  is  determined  by  the  interpolation  procedure)  of  the  corner  coordinate  values.  Each 
corner  coordinate  is  the  result  of  warping  one  point  (Xi,  Yi,  Zi)  from  the  input  surface.  The 
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warping  function  is  simply  a  linear  combination  of  the  input  point  coordinates.  Finally  we 
note  that  the  X  and  Y  components  of  each  original  point  are  obtained  by  inverse  perspective 
projection 

=  and  Yi  =  Vji.  (103) 

Thus,  each  Z  value  obtained  in  step  6  of  the  algorithm  is  a  function  of  three  depth  values 
Zi,  Zj,  Z3  in  the  input  depth  map.  We  determine  this  functional  relationship  below. 

The  second  possibility  is  when  the  depth  value  is  obtained  by  extrapolating  in  step  7 
of  the  algorithm.  In  this  case  a  depth  value  Z  is  computed  as  the  average  of  some  depth 
values  Zi, . . .  Z*  in  its  immediate  neighborhood.  This  is  a  fairly  simple  linear  relationship. 
The  variance  propagation  for  this  case  is  also  discussed  in  detail  below. 


A.2.1  Variances  of  interpolated  depth  values 

We  determined  above  that  each  output  depth  value  Z  is  a  function  of  three  input  depth 
values  Z\ ,  Z2,  Z3.  Let  us  begin  by  determining  the  relationship  between  these  values  and 
the  corresponding  values  (X[,Y{,Z[),  (X^Y^Z^),  (X^Y^Zj)  after  warping.  We  combine 
the  equations  of  motion 

B 

-A 
1 

(105) 

(106) 

(107) 

(108) 

Z[  =  -W  +  bziZ,  (109) 

where 

bxi  =  *//  +  Cy/ f  -  B  (110) 

by  i  =  Cx/f  +  y/f+  A  (111) 

bzi  =  Bx/f-Ay/f+  1.  (112) 

After  the  motion  warping,  we  compute  the  output  depth  by  interpolation.  As  we  recall, 
there  are  three  cases  by  which  Z  can  be  obtained  within  the  interpolation  procedure. 

1.  There  is  exactly  one  point  of  intersection  between  the  ray  and  the  spatial  triangle. 
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2.  The  ray  lies  in  the  same  plane  as  the  spatial  triangle  and  it  has  at  most  one  point  in 
common  with  each  edge  of  the  spatial  triangle. 

3.  The  ray  lies  in  the  same  plane  as  the  spatial  triangle  and  it  coincides  with  one  edge 
of  the  spatial  triangle. 

In  the  first  case,  the  output  Z  value  is  computed  as 

Z  =  (113) 

where 

D  =  (Xx  -  X3)[(Z2  -  Z3)y  -  (Y2  -  y3)/]  -  (114) 

(Yx  -  Y3)[(Z2  -  Z3)x  -  (X2  -  X3)f ]  + 
(Zx-Z3)[(Y2-Y3)x-(X2-X3)y] 

and 

Dx  =  -X3{(Yx-Y3)(Z2-Z3)-{Y2-Y3){Zx-Z3)  ]  (115) 

+Y3[(X x  -  X3)(Z2  -  Z3)  -  (X2  -  X3)(Zx  -  Z3)} 

~z3[(Xi  -  x3)(Y2  -  Y3)  -  (X2  -  x3)(Yx  -  y3)] 

where  the  primes  have  been  omitted.  If  we  use  the  above  equations  (107)  -  (109)  to  replace 
the  variables  in  these  expressions,  we  find 


D  —  aZ\Z2  +  bZ\Z3  +  cZ2Z3 

(116) 

where 

a  =  bxl(bz2y  -  by2f)  -  byl(bt2x  -  bx2f)  +  bzl{by2x  -  bs2y) 

(117) 

b  -  bxl(bv3f  -  bz3y)  -  byl{bx3f  -  bz3x)  +  bti(bx3y  -  by3x) 

(118) 

c  =  -bx3(bz2y  -  by2f)  +  by3(bz2x  -  bx2f )  -  bz3{by2x  -  bx2y). 

(119) 

Further  we  have 

D\  —  aZ\Z2  +  bZ\Z3  +  cZ2Z3  +  dZ\Z2Z3  +  e.Z\Z\  +  fZ2Z$ 

(120) 

where 

a  —  Uax  -  Va2  +  Wa3 

(121) 

b  =  Ubl-Vb2  +  Wb3 

(122) 

c  =  Ucx  -  Vc2  +  Wc3 

(123) 

d  =  —bx3a\  +  by3a2  —  bz3a3 

(124) 

e  =  -bx3b\  +  by3b2  -  bz3b3 

(125) 

f  =  -bX3Ci  +  by3c2  -  bz3c3 

(126) 
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in  which  we  have  abbreviated 


* 


<*1 

= 

by\bz2  -  by2bzl 

(127) 

bi 

= 

bZlby3  -  bylbz3 

(128) 

Cl 

= 

by2bz3  -  by3bz2 

(129) 

(130) 

a2 

= 

bxibz2  -  bx2bti 

h 

= 

bX3bzi  -  bxibZ3 

(131) 

c2 

= 

bx2bz3  -  bx3bz2 

(132) 

(133) 

03 

= 

bx\by2  -  bx2by\ 

bi 

= 

bX3byi  -  bx\by3 

(134) 

C3 

= 

bx2by3  -  by2bx3. 

(135) 

As  a  result  of  these  tedious  manipulations  we  are  now  able  to  express  an  interpolated 
value  of  Z  as  a  function  of  three  input  depth  values: 

Z -*(*.*.*) -/^^  (136) 


If  the  variances  for  the  input  depth  values  are  pi,P2,P3,  we  assume  that  they  are  uncorre¬ 
lated  and  the  above  functional  relationship  can  be  approximated  by  the  first  terms  of  the 
Taylor  series  we  can  use  the  variance  propagation  (102) 


,  dZ  ,  dZ  .2  ,  ,  dZ  .  2 

p  =  (3zT)f"  +  (5^)P2  +  Ww 


(137) 


to  determine  the  output  variance  p.  It  remains  to  determine  the  partial  derivatives 


dZ  .  d  D\  -  D\§z 

dZt  1  dZi  D  D2 


The  partial  derivatives  of  D  are  obtained  easily  from  (116) 


dD 

dZx 

dD 

dZ2 

dD 

dZ3 


aZ2  +  bZ3 
o.Z\  +  cZ3 
bZ\  +  cZ2 


and  similarly  those  for  D\  from  (120) 

dDx 

dZi 
dDx 

dz2 
dDx 
dZ3 


aZ2  4-  bZ3  +  dZ2Z3  -f-  tZ\ 
aZi  +  cZ3  +  dZ\  Z3  +  fZ3 
bZ\  +  cZ2  ■+■  dZ\Z2  +  2eZ\Z3  +  2fZ2Z3. 


(138) 

(139) 

(140) 

(141) 

(142) 

(143) 

(144) 
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Note  that  a,  b  and  c  are  computed  differently  for  D  and  D\.  We  must  also  take  care  not 
to  confuse  the  coefficient  /  used  in  the  expression  for  D\  with  the  focal  length  /.  This 
result  enables  us  to  compute  the  variance  p  of  every  output  depth  value  Z  obtained  by 
interpolation  in  the  case  where  the  ray  has  exactly  one  point  in  common  with  the  spatial 
triangle. 

In  the  second  case  listed  above  we  consider  how  a  depth  value  is  obtained  if  the  in¬ 
terpolation  ray  is  parallel  to  the  spatial  triangle  but  does  not  coincide  with  one  of  its 
edge  segments.  In  this  case  we  recall  that  Z  is  obtained  by  interpolating  between  the  two 
endpoints  (X\,Yi,Zi)  and  (X2,Y2,  Z2)  of  the  triangle  edge  segment: 

Z  =  /y  (145) 

with 

d  =  x(Yl-Y2)-y(Xl-X2)  (146) 

and 

dx  =  (Xt  -  X2)Y2  -  (Y2  -  Y2)X2.  (147) 

All  coordinate  components  are  after  the  motion  warping  although  the  primes  have  been 
omitted.  We  express  both  d  and  d\  in  terms  of  the  depth  values  of  the  endpoints  Z\  and 
Z2  before  motion  warping  as  done  before: 


d  —  aZ\  4-  bZ2 

(148) 

where 

0  —  xbyi  ybxi 

(149) 

b  =  ybx  2  —  xby2 

(150) 

and  similarly 

d\  —  aZ\  4*  bZ2  4-  cZ\Z2 

(151) 

in  which 

a  =  bylU-bxlV 

(152) 

b  =  bx2V-by2U 

(153) 

c  =  bx\by2  —  byibx2. 

(154) 

We  compute  the  partial  derivatives 

dd 

dZi  ~  a 

(155) 

-O 

II 

*15 

(156) 
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and 


a  +  cZi 


(157) 

(158) 


otherwise 

p  =  b]2p2.  (164) 

From  the  point  of  view  of  implementation  we  see  that  the  computation  of  variances 
cannot  simply  be  added  to  the  depth  computation  outlined  in  the  previous  section.  A 
completely  different  approach  is  necessary  in  order  to  have  the  6r,-,hy,,6„  values  available 
in  the  interpolation  stage.  Although  the  depth  computation  is  identical  the  abstraction 
from  the  underlying  geometry  makes  it  less  intuitive. 


A.2.2  Variances  of  extrapolated  depth  values 

A  number  of  output  depth  values  are  obtained  through  extrapolation.  As  described 
above  we  use  a  very  simple  extrapolation  procedure.  In  which  an  unassigned  depth  value  is 
set  to  the  average  of  its  assigned  neighbors.  This  extrapolation  was  motivated  by  assuming 
certain  smoothness  of  the  surface.  Nevertheless,  the  choice  of  variance  at  this  location 
should  indicate  the  fact,  that  no  information  was  available  there  so  that  subsequent  up¬ 
dates  can  contributed  maximally  to  the  value  at  this  point.  This  is  particularly  important 
when  objects  enter  the  field  of  view  that  exhibit  a  depth  discontinuity  with  respect  to  the 
previously  visible  objects.  We  therefore  assign  oo  (some  large  value)  as  the  variance  of 
extrapolated  points. 
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